Lock-free proxy algorithm

Dmitriy Vyukov

unread,

Apr 22, 2007, 6:04:20 AM4/22/07

to dvy...@gmail.com

I have the algorithm (and the implementation) of lock-free object
proxy.
What I mean by "object proxy" - object proxy holds some user object
and provides "read" access to it, and provides "replace object"
operation.
It is something that solves problem that reader_writer_mutex solves.
It is analog of http://home.comcast.net/~vzoom/demos/pc_sample.c
But it has some superior properties (of course if algorithm is
working :) You (I) never can be sure with lock-free algorithm :) ).

Properties of the algorithm:
- 64bit clean - no DCAS
- all access (read and write) is _wait-free_ - i.e. no
InterlockedCompareExchage, and no cycles, just InterlockedExchageAdd
and InterlockedExchage
- have the limit of maximum number of concurrent reading threads -
can be varied by changing "number of stolen bits from pointer" - so
can be set to 32/64/128 - I think this is enough... for now...
- can work in two modes: normal and deffered release
- in normal mode user object released immediatly when last reader/
writer stops using it, read access costs 2 interlocked operations - 1
for acquire and 1 for release
- in deffered release mode user object release can be deffered by
inactive reader, read access costs _zero_ interlocked operations
(amortized)
- no collector back-list of "deleted" objects - straitforward
determinate object deletion

// Usage Example in normal mode (not deffered)
typedef fire::store<int, 6, false> store;
// int - type of user object
// 6 - number of stolen bits - so 64 maximum concurrent reader threads
// false - normal mode (not deffered)

void reader_thread(void* p)
{
store& s = *static_cast<store*>(p);
while (true)
{
store::scoped_handle h (s);
int const& v = h.ref();
// work with v
}
}

void writer_thread(void* p)
{
store& s = *static_cast<store*>(p);
while (true)
{
int* v = new int(0);
s.reset_object(v);
}
}

int main()
{
store s (new int(1));
_beginthread(reader_thread, 0, &s);
_beginthread(reader_thread, 0, &s);
_beginthread(writer_thread, 0, &s);
Sleep(10000);
}

// Usage Example in deffered mode
typedef fire::store<int> store;

void reader_thread(void* p)
{
store& s = *static_cast<store*>(p);
store::local_mediator m (s); // <------- main difference
while (true)
{
store::scoped_handle h (m); // <---- costs no interlocked operations
and no membars
int const& v = h.ref();
// work with v
}
}

// writer_thread() and main() - the same

Here is the implementation:
http://files.rsdn.ru/38267/fire_store.hpp
http://files.rsdn.ru/38267/fire_store_impl.hpp
It is written for MSVC - but can be easy ported to gcc and other - the
main problem - intrinsic functions _InterlockedExchange and
_InterlockedExchangeAdd - all other - standart C++

I will be very appriciate for review/comments/suggestions/bugs
If it works like expected - I think it is near the ideal for storing
global shared read-mostly objects like program settings and caches and
other...

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 22, 2007, 6:50:42 AM4/22/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1177236260.0...@e65g2000hsc.googlegroups.com...

>I have the algorithm (and the implementation) of lock-free object
> proxy.

[...]

> It is something that solves problem that reader_writer_mutex solves.
> It is analog of http://home.comcast.net/~vzoom/demos/pc_sample.c
> But it has some superior properties (of course if algorithm is
> working :) You (I) never can be sure with lock-free algorithm :) ).

Its nice to know that somebody is actually interested in that "older" PDR
code I posted. Thank you!

http://groups.google.com/group/comp.programming.threads/msg/06a5416eadd3b5eb
(pdr term origins...)

;^)

> Properties of the algorithm:
> - 64bit clean - no DCAS
> - all access (read and write) is _wait-free_ - i.e. no
> InterlockedCompareExchage, and no cycles, just InterlockedExchageAdd
> and InterlockedExchage

Okay.

[...]

> - can work in two modes: normal and deffered release

[...]

Humm. Sounds like you made some fairly interesting "improvements"...

> Here is the implementation:
[...]

> I will be very appriciate for review/comments/suggestions/bugs
> If it works like expected - I think it is near the ideal for storing
> global shared read-mostly objects like program settings and caches and
> other...

Well, first things first...

How comfortable are you with navigating through my 'pc_sample.c' code?

http://groups.google.com/group/comp.arch/browse_frm/thread/b2f751d5dad6c07b
http://groups.google.com/group/comp.lang.c++.moderated/msg/bc1506fb6d0e3ba7
(intros...)

How well do you understand the reference counting semantics that it uses?
Here is a simple explanation of the technique:

http://groups.google.com/group/comp.programming.threads/msg/08af0ba0d985027b
(simple proof...)

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f2c94118046142e8
(interesting thread to say the least!)

How comfortable are you with the usage pattern of PDR in general?

Have you ever build a user-space RCU prototype?

How comfortable are you with assembly language (e.g., x86, SPARC, PowerPC,
ARM, ect...)??

Have you looked at any of Joe Seighs extensive work on PDR:

http://sourceforge.net/project/showfiles.php?group_id=127837
(rcpc, atomic-ptr based collector, and of course, fastsmr...)

Are you familiar with Joes atomic-ptr algorithm?

I am looking forward to reading through your "current" algorithm. We can
definitely help you out with virtually every aspect of its design. Feel free
to ask any questions you want on this topic... You are indeed in the correct
newsgroup for this sort of stuff!

I will be able to get a feel for how advanced you are wrt PDR in general
after I get some answers to some of those questions!

Any thoughts?

:^)

Chris Thomasson

unread,

Apr 22, 2007, 6:53:36 AM4/22/07

to

I think you have a typo in the following file:

http://files.rsdn.ru/38267/fire_store_impl.hpp

look at the very last function at the end of the page:

_________
template<typename type, unsigned stolen_bits, bool deffered_release>
shared_handle<type, stolen_bits, deffered_release>& shared_handle<type,
stolen_bits, deffered_release>::operator = (const shared_handle& other)
{
if (other.ptr_ == ptr_) return *this;
p_.release(ptr_);
ptr_ = other.ptr_;
p.cquire(ptr_);
return *this;
}

}
___________

I think you meant 'p.acquire(ptr_)' instead of 'p.cquire(ptr_)'

;^)

Dmitriy Vyukov

unread,

Apr 22, 2007, 6:58:56 AM4/22/07

to

On 22 апр, 14:04, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> Here is the implementation:

It seems the link is broken - so I just post all code:

#pragma once

namespace fire
{

namespace detail
{

template<typename type, unsigned stolen_bits, bool deffered_release>

class scoped_handle;

template<typename type, unsigned stolen_bits, bool deffered_release>

class shared_handle;
template<typename type, unsigned stolen_bits> class stored_type;
template<typename type, unsigned stolen_bits> class local_mediator;
}

/** Store for thread-shared object
* Offers the read access to stored object and replace object
operation
* Read access to object is transactional
* All access is wait-free
* @template type - type of stored object
* @template stolen_bits - defines the maximum number of concurrent
reading threads
* maximum number of concurrent reading threads = 2 ^ stolen_bits
* @template deffered_release - defines object releasing mode
* when deffered_release == false: object released immediatly when
* last reader/writer stops using it
* read access costs 2 interlocked operations - 1 for acquire and 1
for release
* when deffered_release == true: object release can be deffered by
inactive reader
* read access costs 0 interlocked operations (amortized)
*/
template<typename type, unsigned stolen_bits = 6, bool
deffered_release = true>
class store
{
public:
typedef detail::scoped_handle<type, stolen_bits, deffered_release>
scoped_handle;
typedef detail::shared_handle<type, stolen_bits, deffered_release>
shared_handle;

typedef detail::local_mediator<type, stolen_bits> local_mediator;

typedef void(*deleter_t)(type*);

explicit store(type* obj);
store(type* obj, deleter_t deleter);

void reset_object(type* obj);

~store();

private:
typedef detail::stored_type<type, stolen_bits> stored_type;

deleter_t deleter_;

/** Structure of counted_ptr_:
* High stolen_bits bits - usage counter
* (number of reader thread that ever read this object)
* Low (32 - stolen_bits) bits - pointer to stored_type object
* (that holds pointer to user object), low stolen_bits bits of
pointer
* omitted (they are zeroes)
*/
unsigned volatile mutable counted_ptr_;

stored_type volatile* get_stored() const;
stored_type volatile* get_stored(stored_type volatile*& cache) const;

static void acquire(stored_type volatile* ptr);
static void release(stored_type volatile* ptr);

store(store const&);
store& operator = (store const&);

friend class scoped_handle;
friend class shared_handle;
friend class local_mediator;
};

namespace detail
{

template<typename type, unsigned stolen_bits, bool deffered_release>

struct provider
{
typedef local_mediator<type, stolen_bits> type;
};

template<typename type, unsigned stolen_bits>
struct provider<type, stolen_bits, false>
{
typedef store<type, stolen_bits, false> type;
};

template<typename type, unsigned stolen_bits, bool deffered_release>

class scoped_handle
{
public:
typedef typename provider<type, stolen_bits, deffered_release>::type
provider;

explicit scoped_handle(provider const& p);
~scoped_handle();

type const* operator -> () const;
type const& operator*() const;
type const* ptr() const;
type const& ref() const;

protected:
stored_type<type, stolen_bits> volatile* ptr_;
provider const& p_;

scoped_handle(scoped_handle const& other);
scoped_handle& operator = (scoped_handle const& other);
};

template<typename type, unsigned stolen_bits, bool deffered_release>

class shared_handle : public scoped_handle<type, stolen_bits,
deffered_release>
{
public:
typedef typename scoped_handle<type, stolen_bits,
deffered_release>::provider provider;

explicit shared_handle(provider const& p);
shared_handle(shared_handle const& other);
shared_handle& operator = (shared_handle const& other);
};

template<typename type, unsigned stolen_bits>
class local_mediator
{
public:
typedef store<type, stolen_bits, true> store;
typedef stored_type<type, stolen_bits> stored_type;

local_mediator(store const& s);
~local_mediator();

void flush();

private:
store const& s_;
mutable stored_type volatile* cache_;
mutable unsigned counter_;

stored_type volatile* get_stored() const;
void acquire(stored_type volatile* ptr) const;
void release(stored_type volatile* ptr) const;

local_mediator(local_mediator const&);
local_mediator& operator = (local_mediator const&);

friend class scoped_handle<type, stolen_bits, true>;
friend class shared_handle<type, stolen_bits, true>;
};

}}

#include <cassert>

extern "C" long _InterlockedExchange(long volatile* Target, long
Value);
extern "C" long _InterlockedExchangeAdd(long volatile* Addend, long
Value);
#pragma intrinsic (_InterlockedExchange)
#pragma intrinsic (_InterlockedExchangeAdd)

namespace fire
{

template<typename type, unsigned stolen_bits, bool deffered_release>

store<type, stolen_bits, deffered_release>::store(type* obj)
: deleter_(&stored_type::delete_object)
, counted_ptr_(stored_type::create(obj, deleter_))
{
assert(obj);
}

template<typename type, unsigned stolen_bits, bool deffered_release>

store<type, stolen_bits, deffered_release>::store(type* obj, deleter_t
deleter)
: deleter_(deleter)
, counted_ptr_(stored_type::create(obj, deleter_))
{
}

template<typename type, unsigned stolen_bits, bool deffered_release>

store<type, stolen_bits, deffered_release>::~store()
{
unsigned const dest = _InterlockedExchange((long
volatile*)&counted_ptr_, 0u);
stored_type::release(dest);
}

template<typename type, unsigned stolen_bits, bool deffered_release>

typename store<type, stolen_bits, deffered_release>::stored_type
volatile*
store<type, stolen_bits, deffered_release>::get_stored() const
{
unsigned const ret = _InterlockedExchangeAdd((long
volatile*)&counted_ptr_, (1u << stored_type::remain_bits));
return stored_type::outer_extract_pointer(ret);
}

template<typename type, unsigned stolen_bits, bool deffered_release>

typename store<type, stolen_bits, deffered_release>::stored_type
volatile*
store<type, stolen_bits, deffered_release>::get_stored(stored_type
volatile*& cache) const
{
if (!cache) return cache = get_stored();
stored_type volatile* const s =
stored_type::outer_extract_pointer(counted_ptr_);
if (s == cache) return cache;
cache->decrement();
return cache = get_stored();
}

template<typename type, unsigned stolen_bits, bool deffered_release>

void store<type, stolen_bits, deffered_release>::reset_object(type*
obj)
{
assert(obj);
unsigned dest = stored_type::create(obj, deleter_); // создали новый
хранимый объект
dest = _InterlockedExchange((long volatile*)&counted_ptr_, dest); //
Заменяем хранимый объект на новый
stored_type::release(dest);
}

template<typename type, unsigned stolen_bits, bool deffered_release>

void store<type, stolen_bits, deffered_release>::acquire(stored_type
volatile* ptr)
{
ptr->increment();
}

template<typename type, unsigned stolen_bits, bool deffered_release>

void store<type, stolen_bits, deffered_release>::release(stored_type
volatile* ptr)
{
ptr->decrement();
}

namespace detail
{

template<typename type, unsigned stolen_bits>
class stored_type
{
public:
static unsigned const remain_bits = 32 - stolen_bits;
static unsigned const alignment = 1 << stolen_bits;

/** User object */
type* obj_;

char* raw_mem_;

/** Structure of count_removed_:
* LSB - flag means that this object is already removed from store
* All other bits -
* if (LSB == 0) - -number of reader threads, that already release
this object
* if (LSB == 1) - number of reader threads, that still holds this
object
* When this objects is replaced from store, writer thread sets LSB =
1
* and transfers counter from store::counted_ptr_ to
stored_type::count_removed_
*/
unsigned volatile count_removed_;

typedef void(*deleter_t)(type*);

deleter_t deleter_;

stored_type(type* obj, char* raw_mem, deleter_t deleter)
: obj_(obj)
, raw_mem_(raw_mem)
, count_removed_()
, deleter_(deleter)
{}

static unsigned create(type* obj, deleter_t deleter)
{
try
{
char* p = new char[sizeof(stored_type) + alignment];
char* pa = ((int)p % alignment) ? (p - ((int)p % alignment) +
alignment) : p;
new(pa) stored_type(obj, p, deleter);
return (unsigned)pa >> stolen_bits;
}
catch (...)
{
// release the user object
deleter(obj);
throw;
}
}

static type* create_object(type const& obj)
{
return new type(obj);
}

static void delete_object(type* obj)
{
delete obj;
}

static void release(unsigned const outer)
{
outer_extract_pointer(outer)->remove(outer_extract_counter(outer));
}

~stored_type()
{
// release the user object
deleter_(obj_);
}

type* get() volatile
{
return obj_;
}

void decrement() volatile
{
check(_InterlockedExchangeAdd((long volatile*)&count_removed_,
(unsigned)-2) - 2);
}

void increment() volatile
{
_InterlockedExchangeAdd((long volatile*)&count_removed_, 2u);
}

void remove(unsigned const count) volatile
{
unsigned const val = (count << 1) + 1;
check(_InterlockedExchangeAdd((long volatile*)&count_removed_, val)
+ val);
}

void check(unsigned const value) volatile
{
if (1 == value)
{
char* p = this->raw_mem_;
this->~stored_type();
delete[] p;
}
}

static stored_type volatile* outer_extract_pointer(unsigned x)
{
return (stored_type*)(x << stolen_bits);
}

static unsigned outer_extract_counter(unsigned x)
{
return (x >> remain_bits);
}

static unsigned inner_extract_counter(unsigned x)
{
return (x >> 1);
}

static bool inner_extract_flag(unsigned x)
{
return 1 == (x & 1);
}

private:
stored_type(stored_type const&);
stored_type& operator = (stored_type const&);

/** Statically check that (stolen_bits >= 1 && stolen_bits <= 16) */
typedef char static_assert_stolen_bits[stolen_bits >= 1 &&
stolen_bits <= 16 ? 1 : -1];
};

template<typename type, unsigned stolen_bits>
local_mediator<type, stolen_bits>::local_mediator(store const& s)
: s_(s)
, cache_()
, counter_()
{
}

template<typename type, unsigned stolen_bits>
local_mediator<type, stolen_bits>::~local_mediator()
{
assert(!counter_);
flush();
}

template<typename type, unsigned stolen_bits>
void local_mediator<type, stolen_bits>::flush()
{
if (!cache_ || counter_) return;
cache_->decrement();
cache_ = 0;
}

template<typename type, unsigned stolen_bits>
typename local_mediator<type, stolen_bits>::stored_type volatile*
local_mediator<type, stolen_bits>::get_stored() const
{
if (counter_++)
{
assert(cache_);
return cache_;
}
return s_.get_stored(cache_);
}

template<typename type, unsigned stolen_bits>
void local_mediator<type, stolen_bits>::acquire(stored_type volatile*
ptr) const
{
assert(ptr == cache_);
++counter_;
}

template<typename type, unsigned stolen_bits>
void local_mediator<type, stolen_bits>::release(stored_type volatile*
ptr) const
{
assert(counter_);
--counter_;
}

template<typename type, unsigned stolen_bits, bool deffered_release>

scoped_handle<type, stolen_bits,
deffered_release>::scoped_handle(provider const& p)
: ptr_(p.get_stored())
, p_(p)
{
}

template<typename type, unsigned stolen_bits, bool deffered_release>

scoped_handle<type, stolen_bits, deffered_release>::~scoped_handle()
{
p_.release(ptr_);
}

template<typename type, unsigned stolen_bits, bool deffered_release>

type const* scoped_handle<type, stolen_bits,
deffered_release>::operator -> () const
{
return ptr_->get();
}

template<typename type, unsigned stolen_bits, bool deffered_release>

type const* scoped_handle<type, stolen_bits, deffered_release>::ptr()
const
{
return ptr_->get();
}

template<typename type, unsigned stolen_bits, bool deffered_release>

type const& scoped_handle<type, stolen_bits, deffered_release>::ref()
const
{
return *ptr_->get();
}

template<typename type, unsigned stolen_bits, bool deffered_release>

type const& scoped_handle<type, stolen_bits,
deffered_release>::operator * () const
{
return *ptr_->get();
}

template<typename type, unsigned stolen_bits, bool deffered_release>
shared_handle<type, stolen_bits,

deffered_release>::shared_handle(provider const& p)
: scoped_handle<type, stolen_bits, deffered_release>(p)
{
}

template<typename type, unsigned stolen_bits, bool deffered_release>
shared_handle<type, stolen_bits,

deffered_release>::shared_handle(shared_handle const& other)
{
ptr_ = other.ptr_;
p_.acquire(ptr_);

Dmitriy Vyukov

unread,

Apr 22, 2007, 7:11:51 AM4/22/07

to

On 22 апр, 14:53, "Chris Thomasson" <cris...@comcast.net> wrote:
> I think you have a typo in the following file:
>
>

> I think you meant 'p.acquire(ptr_)' instead of 'p.cquire(ptr_)'
>
> ;^)

... yes... templates...
I have working version based on DCAS and then I change some things -
and I don't yet get the full testing... [shuffle] :)
But I think, main thing - algorithm....

Dmitriy Vyukov

unread,

Apr 22, 2007, 7:38:35 AM4/22/07

to

On 22 апр, 14:50, "Chris Thomasson" <cris...@comcast.net> wrote:

> Its nice to know that somebody is actually interested in that "older" PDR
> code I posted. Thank you!

There are not so much working ones, working without GC, without type-
stable memory, without MCAS, without ... :)

> http://groups.google.com/group/comp.programming.threads/msg/06a5416ea...
> (pdr term origins...)

PDR - proxy deffered reclamation?

> Humm. Sounds like you made some fairly interesting "improvements"...

I hope... :)

> How comfortable are you with navigating through my 'pc_sample.c' code?

I am not so good at english (indefinitely better in C++ :) ), but I
hope I understand, what you mean, and I try to answer :)
It's not so easy... like any of lock-free algorithms.
But I read it closely - and I understand the algorithm.
There some thing that helps me - some time ago I with my colleague
developed similar algorithm based on DCAS - the same "outer counter",
the same "inner counter", the same "counter transfer" :) But we have
no back-list of deleted objects.

> How well do you understand the reference counting semantics that it uses?

I think, I do completely...

> How comfortable are you with the usage pattern of PDR in general?

You mean, how it used by library user?
Сomfortable enough... but I use it not with "C-interface", but with "C+
+ -interface" - like in my usage example.

> Have you ever build a user-space RCU prototype?

Can you explain the question more detailed?

> How comfortable are you with assembly language (e.g., x86, SPARC, PowerPC,
> ARM, ect...)??

I read x86 assembly code confidently.
Can write x86, but don't like :) I prefer C++
And I think I can read SPARC, PowerPC, ARM - there are the same
stores, loads, incs, jmps... :)

> Have you looked at any of Joe Seighs extensive work on PDR:
> http://sourceforge.net/project/showfiles.php?group_id=127837
> (rcpc, atomic-ptr based collector, and of course, fastsmr...)
> Are you familiar with Joes atomic-ptr algorithm?

Not yet, but I will. Fast look - I see the same "collector queue" - I
don't have "collector queue" in my algorithm...

> I am looking forward to reading through your "current" algorithm. We can
> definitely help you out with virtually every aspect of its design. Feel free
> to ask any questions you want on this topic... You are indeed in the correct
> newsgroup for this sort of stuff!

The main question - is it works? Didn't I overlook something?

> :^)

:^)

Dmitriy Vyukov

unread,

Apr 22, 2007, 12:44:47 PM4/22/07

to

On 22 апр, 14:50, "Chris Thomasson" <cris...@comcast.net> wrote:

> How well do you understand the reference counting semantics that it uses?
> Here is a simple explanation of the technique:
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...

> (interesting thread to say the least!)

I've read "Bringing Practical Lock-Free Synchronization to 64-bit
Applications".
And I refer to this paper here:
http://groups.google.com/group/comp.programming.threads/browse_thread/thread/1a3ef233d3717505/2d599e18dc478945?lnk=gst&q=Dmitriy+Vyukov+top+vector&rnum=1#2d599e18dc478945
And you say "...algorithm that uses 5'CAS!!! What the hell was the
author of that paper thinking!" :)

Joe Seigh

unread,

Apr 22, 2007, 6:31:06 PM4/22/07

to

Dmitriy Vyukov wrote:
> On 22 апр, 14:50, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>
>>Its nice to know that somebody is actually interested in that "older" PDR
>>code I posted. Thank you!
>
>
> There are not so much working ones, working without GC, without type-
> stable memory, without MCAS, without ... :)
>
>
>>http://groups.google.com/group/comp.programming.threads/msg/06a5416ea...
>>(pdr term origins...)
>
>
> PDR - proxy deffered reclamation?
>
>

The P stands for PCOW (partial copy on write) which characterizes most
lock-free algorithms which attempt to avoid a full copy on write since
for a lot of data structures that would entail a lot of overhead.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Chris Thomasson

unread,

Apr 22, 2007, 11:29:21 PM4/22/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1177236260.0...@e65g2000hsc.googlegroups.com...

>I have the algorithm (and the implementation) of lock-free object
> proxy.

[...]

The code is a little tricky to navigate. Could you please post some brief
high-level pseudo-code for the algorithm itself?

Dmitriy Vyukov

unread,

Apr 23, 2007, 12:22:27 AM4/23/07

to

On 23 апр, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

I already think about this...
Ok, I post algorithm in more... traditional form :)
I don't have a lot of time for this activity, but I try to do this
today...

And I've find some bugs in code.

Structure of stored_type::count_removed_ incorrect.
Reference counter must start not from second bit, but from (32 -
stolen_bits) bit - like in store::counted_ptr_

And scoped_handle doesn't define copy ctor operator =, there must be:
scoped_handle(scoped_handle const& other) {}
scoped_handle& operator = (scoped_handle const& other) {return
this;}

I fix this bugs in high-level pseudo-code

Dmitriy Vyukov

unread,

Apr 23, 2007, 7:42:58 AM4/23/07

to

On Apr 23, 8:22 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> On 23 ÁÐÒ, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > The code is a little tricky to navigate. Could you please post some brief
> > high-level pseudo-code for the algorithm itself?
>
> I already think about this...
> Ok, I post algorithm in more... traditional form :)

Here is the code for normal work mode (not deffered, derrered mode I
post later).
I hope, this version is more straightforward...

#pragma once

#include <windows.h>
#include <process.h>
#include <stdio.h>

extern "C" long _InterlockedExchange(long*, long);
extern "C" long _InterlockedExchangeAdd(long*, long);

#pragma intrinsic (_InterlockedExchange)
#pragma intrinsic (_InterlockedExchangeAdd)

#define WORD_WIDTH 32
#define STOLEN_BITS 6
#define REMAIN_BITS (WORD_WIDTH - STOLEN_BITS)
#define ALIGNMENT (1 << STOLEN_BITS)
#define TYPE my_data

// test tracer - used as user data
struct my_data
{
int i;

my_data(int i) : i(i)
{
printf("ctor %d\n", i);
Sleep(0);
}

~my_data()
{
printf("dtor %d\n", i);
Sleep(0);
}
};

union inner_union
{
// target for interlocked instructions
long whole;
struct
{
// indicator that this holder object
// was removed from store
unsigned removed : 1;

// not used
unsigned padding : REMAIN_BITS - 1;

// number of readers, that release the object
unsigned counter : STOLEN_BITS;
};
};

union outer_union
{
// target for interlocked instructions
long whole;
struct
{
// pointer to holder instance
// w/o STOLEN_BITS least significant bits
// and shifted right by STOLEN_BITS
unsigned ptr : REMAIN_BITS;

// number of readers, that acquire the object
unsigned counter : STOLEN_BITS;
};
};

struct holder
{
// stored user object
TYPE* user_obj;

// pointer to unaligned buffer that was returned from malloc()
char* buf_ptr;

// remove flag + release counter
inner_union inner;
};

class store
{
// pointer to holder + acquire counter
outer_union outer;

static outer_union create_holder(TYPE* obj)
{
char* p = new char[sizeof(holder) + ALIGNMENT];
char* pa = ((int)p % ALIGNMENT) ? (p - ((int)p % ALIGNMENT) +
ALIGNMENT) : p;
holder* h = (holder*)pa;
h->user_obj = obj;
h->buf_ptr = p;
h->inner.whole = 0;
unsigned ptr = ((unsigned)pa >> STOLEN_BITS);
outer_union outer;
outer.ptr = ptr;
outer.counter = 0;
return outer;
}

static void replace_holder(outer_union outer)
{
holder* h = (holder*)(outer.ptr << STOLEN_BITS);
inner_union addend;
addend.removed = 1;
addend.counter = outer.counter;
addend.padding = 0;
release_holder(h, addend);
}

static void release_holder(holder* h, inner_union addend)
{
inner_union new_inner;
new_inner.whole =
(_InterlockedExchangeAdd(&h->inner.whole, addend.whole) +
addend.whole);
if (new_inner.removed == 1 && new_inner.counter == 0)
{
delete h->user_obj;
delete[] h->buf_ptr;
}
}

public:
store(TYPE* obj)
{
outer = create_holder(obj);
}

void replace_obj(TYPE* obj)
{
outer_union new_outer = create_holder(obj);
outer_union old_outer;
old_outer.whole =
_InterlockedExchange(&outer.whole, new_outer.whole);
replace_holder(old_outer);
}

holder* acquire()
{
outer_union addend;
addend.counter = 1;
addend.ptr = 0;
outer_union old_whole;
old_whole.whole =
_InterlockedExchangeAdd(&outer.whole, addend.whole);
holder* h = (holder*)(old_whole.ptr << STOLEN_BITS);
return h;
}

static void release(holder* h)
{
inner_union addend;
addend.counter = -1;
addend.removed = 0;
addend.padding = 0;
release_holder(h, addend);
}

static void add_ref(holder* h)
{
inner_union addend;
addend.counter = 1;
addend.removed = 0;
addend.padding = 0;
_InterlockedExchangeAdd(&h->inner.whole, addend.whole);
}

~store()
{
outer_union old_outer;
old_outer.whole =
_InterlockedExchange(&outer.whole, 0);
replace_holder(old_outer);
}
};

// And here is the test:

long dead_counter = 0;

void reader_thread(void* p)
{
store& s = *(store*)p;
for (int i = 0; i < 2; ++i)
{
holder* h = s.acquire();
// here is our object
my_data* data = h->user_obj;
// work with data
printf("read %d\n", data->i);
Sleep(0);
// get one more ref
store::add_ref(h);
// ...
store::release(h);
store::release(h);
}
_InterlockedExchangeAdd(&dead_counter, 1);
}

void writer_thread(void* p)
{
store& s = *(store*)p;
for (int i = 1; i < 3; ++i)
{
printf("repl %d\n", i);
s.replace_obj(new my_data(i));
Sleep(0);
}
_InterlockedExchangeAdd(&dead_counter, 1);
}

int main()
{
{
store s (new my_data(0));

_beginthread(reader_thread, 0, &s);
_beginthread(writer_thread, 0, &s);
_beginthread(reader_thread, 0, &s);
_beginthread(writer_thread, 0, &s);

do {Sleep(1000);} while (dead_counter != 4);
}

printf("dead\n");
}

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 23, 2007, 8:18:24 AM4/23/07

to

On Apr 23, 3:42 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 23, 8:22 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > On 23 ÁÐÒ, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > > The code is a little tricky to navigate. Could you please post some brief
> > > high-level pseudo-code for the algorithm itself?
>
> > I already think about this...
> > Ok, I post algorithm in more... traditional form :)
>
> Here is the code for normal work mode (not deffered, derrered mode I
> post later).
> I hope, this version is more straightforward...

And here is the code for deferred mode. I'll post only additions to
the code.

Here is local_mediator class definition:

// thread-local mediator for deferred release
struct local_mediator
{
holder* cache;
unsigned counter;

local_mediator()
: cache()
, counter()
{}
};

And here is new public methods of store class:

holder* acquire(local_mediator* m)
{
if (m->counter++)
return m->cache;
if (m->cache)

{
holder* h = (holder*)(outer.ptr << STOLEN_BITS);

if (m->cache == h)
return m->cache;
release(m->cache);
}
m->cache = acquire();
return m->cache;
}

static void release(local_mediator* m, holder* h)
{
m->counter--;
}

static void add_ref(local_mediator* m, holder* h)
{
m->counter++;
}

static void flush(local_mediator* m)
{
if (m->cache && !m->counter)
{
release(m->cache);
m->cache = 0;
}
}

And here is new test thread:

void deferred_reader_thread(void* p)
{
store& s = *(store*)p;
local_mediator m;

for (int i = 0; i < 2; ++i)
{

holder* h = s.acquire(&m);

// here is our object
my_data* data = h->user_obj;
// work with data
printf("read %d\n", data->i);
Sleep(0);
// get one more ref

store::add_ref(&m, h);
// ...
store::release(&m, h);
store::release(&m, h);
}
store::flush(&m);
_InterlockedExchangeAdd(&dead_counter, 1);
}

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 23, 2007, 8:24:18 AM4/23/07

to

> > > On 23 ÁÐÒ, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > > > The code is a little tricky to navigate. Could you please post some brief
> > > > high-level pseudo-code for the algorithm itself?

Chris, please review this new version of code.
I can make some additional comments, but I hope this version is clean
enough...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 25, 2007, 8:58:53 AM4/25/07

to

On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> We can
> definitely help you out with virtually every aspect of its design. Feel free
> to ask any questions you want on this topic... You are indeed in the correct
> newsgroup for this sort of stuff!

Are here any lock-free specialists? If not here, than where? :)
Chris, where are you? :)

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 25, 2007, 10:00:48 AM4/25/07

to

On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> We can
> definitely help you out with virtually every aspect of its design. Feel free
> to ask any questions you want on this topic... You are indeed in the correct
> newsgroup for this sort of stuff!

Are here any lock-free specialists? If not here, than where? :)

Dmitriy Vyukov

unread,

Apr 25, 2007, 11:37:38 AM4/25/07

to

On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> We can
> definitely help you out with virtually every aspect of its design. Feel free
> to ask any questions you want on this topic... You are indeed in the correct
> newsgroup for this sort of stuff!

Are any lock-free specialists here? If not here, than where? :)

Dmitriy Vyukov

unread,

Apr 25, 2007, 12:07:13 PM4/25/07

to

On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> We can
> definitely help you out with virtually every aspect of its design. Feel free
> to ask any questions you want on this topic... You are indeed in the correct
> newsgroup for this sort of stuff!

Are any lock-free specialists here? If not here, than where? :)

Joe Seigh

unread,

Apr 25, 2007, 10:12:31 PM4/25/07

to

They're here. Not everyone has time to read code though. You
might want to explain the techniques you are using and how they
compare to known techniques or solve known problems.

Chris Thomasson

unread,

Apr 26, 2007, 12:23:51 AM4/26/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1177517233.7...@n35g2000prd.googlegroups.com...

Sorry! Chris has not had time to completely review you code. In order to do
that, I will need to build a prototype of it. This may be worth doing
because I want to verify a potential problem I think I found... Well, it has
to do with a lack of a back-link queue... I am not so sure I can
successfully traverse through a circular shared linked data-structure with
your "current" collector. I believe your semantics will work fine for single
objects... However, traversal through dynamic linked lists, well, I think
there may be a problem: You need to be sure that a proxy collector is not
destroyed unless all previous ones were destroyed; this allows for
traversals... Also, you need to follow FIFO ordering wrt the calling of
destructors of the proxy objects...

Dmitriy Vyukov

unread,

Apr 26, 2007, 10:58:00 AM4/26/07

to

On Apr 26, 8:23 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> I am not so sure I can
> successfully traverse through a circular shared linked data-structure with
> your "current" collector. I believe your semantics will work fine for single
> objects...

I build it for exactly one object - as a replacement for reader-writer
lock
to some "global" object that changed infrequently and is read often.
Some kind of application settings or some cache etc...
In this situation it gives extremely fast solution, i.e. no locks,
interlocked operations or membars on reader side...

> However, traversal through dynamic linked lists, well, I think
> there may be a problem: You need to be sure that a proxy collector is not
> destroyed unless all previous ones were destroyed; this allows for
> traversals... Also, you need to follow FIFO ordering wrt the calling of
> destructors of the proxy objects...

If you want such semantics, then you must hold handle
store::shared_handle/scoped_handle
to the next node in the previous node. Then you get FIFO order of
destruction and the previous node will not be destroyed before the
next node.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 26, 2007, 11:09:54 AM4/26/07

to

On Apr 26, 6:12 am, Joe Seigh <jseigh...@xemaps.com> wrote:

> They're here. Not everyone has time to read code though. You
> might want to explain the techniques you are using and how they
> compare to known techniques or solve known problems.

I post algorithm here:
http://groups.google.com/group/comp.programming.threads/tree/browse_frm/thread/6cfe6b1a6926680d/1c8c155c3e052e87?rnum=11&hl=en&_done=%2Fgroup%2Fcomp.programming.threads%2Fbrowse_frm%2Fthread%2F6cfe6b1a6926680d%2F%3Fhl%3Den%26#doc_bbea1d0142adb5c7
I don't think, more words would help to understand the algorithm
better. Especially since you and Chris Thomasson must be familiar with
"outer counter", "inner counter", "counter transfer" etc ;) I have
seen appcore and your Lock-free synchronization primitives ;)

How they compare to known techniques:
I don't seen the solutions that allow read access to shared structure
in wait-free manner and with no membars amortized...

How they solve known problems:
I think that this solution solves near ideal problem of accessing some
global shared data, that changed infrequently and is read frequently -
some kind of application settings or some cache etc.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 26, 2007, 11:50:38 AM4/26/07

to

On Apr 26, 6:12 am, Joe Seigh <jseigh...@xemaps.com> wrote:

> They're here. Not everyone has time to read code though. You
> might want to explain the techniques you are using and how they
> compare to known techniques or solve known problems.

I post algorithm here:

Joe Seigh

unread,

Apr 26, 2007, 8:20:36 PM4/26/07

to

Generally they're PDR memory management routines needed to make lock-free
algorithms that use PCOW work.

For refcounting it depends on whether you are implementing just plain
threadsafe or atomically threadsafe. If the latter, it's nice if you
point out the technique that avoids the well know race condition which
you might mention so we know that you know it.

Some of the known refcounting techniques, there are two on the atomic_ptr
site, one involves using DCAS or KCSS. The others involve using PDR to
manage the refcounts so you can avoid the race condition. Examples are
rcuref in the Linux kernel, and hazard pointer based refcounting. And
some VZOOM based refcounting I'm sure.

Chris Thomasson

unread,

Apr 27, 2007, 5:34:53 PM4/27/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1177302147.8...@e65g2000hsc.googlegroups.com...

> On 23 апр, 07:29, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
>>
>> news:1177236260.0...@e65g2000hsc.googlegroups.com...>I have the
>> algorithm (and the implementation) of lock-free object
>> > proxy.
>>
>> [...]
>>
>> The code is a little tricky to navigate. Could you please post some brief
>> high-level pseudo-code for the algorithm itself?
>
> I already think about this...
> Ok, I post algorithm in more... traditional form :)
> I don't have a lot of time for this activity, but I try to do this
> today...

[...]

Well, AFAICT the algorithm should work. Humm... IMHO, your counting
technique is "kind of" looking more and more like atomic_ptr... I think I
could fairly easily rig atomic_ptr to use the alignment trick with a bounded
count. Then atomic_ptr would not have to use compare-and-swap. The following
structures would allow for bounded reference counted atomic_ptr without CAS:

struct refcount {
unsigned short ephemeral;
unsigned short reference;
};

template<typename T> struct atomic_ptr_ref {
refcount count;
T *ptr;
};

template<typename T, ptrdiff_t T_RefBits> struct ref {
atomic_ptr_ref<T> *ptr; // alignment trick on T_RefBits boundary.

// This is analogous to:
// struct anchor {
// atomic_ptr_ref<T> *ptr;
// ptrdiff_t T_RefBits;
// };

};

What do ya think Joe? ;^)

Chris Thomasson

unread,

Apr 27, 2007, 5:57:16 PM4/27/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1177599480....@c18g2000prb.googlegroups.com...

> On Apr 26, 8:23 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>> I am not so sure I can
>> successfully traverse through a circular shared linked data-structure
>> with
>> your "current" collector. I believe your semantics will work fine for
>> single
>> objects...

Okay. Let me see if I can orient my terminology with your implementation:

Your proxy object /w private references is the holder class right?

Your proxy object anchor /w master references is the store class right?

:^)

> In this situation it gives extremely fast solution, i.e. no locks,
> interlocked operations or membars on reader side...

Well, you do have to use an atomic operation every single time you acquire
or release a holder from a store. AFAICT, your current collector goes has
follows:

holder = store.acquire();
membar #StoreLoad | #StoreStore;

// read

membar #LoadStore | #StoreStore;
store.release(holder);

So, your not really getting memory barrier free reads. The #StoreLoad
barrier will damage the performance of the following read by blowing away
any thing in the pipelines and cause FSB activity.

Joe Seighs SMR+RCU and *vZOOM can both be configured to acquire, and release
a reference to one of your holder objects without using any interlocked RMW,
or even membars on non-alpha systems...

*--
vZOOM can be based on per-thread/cpu-schedule sync epochs. The polling
thread is basically the proxy object, and the proxy object anchor's are
setup per-thread... So a thread can acquire a reference to a proxy object by
doing a simple increment to a per-thread counter. The thread only has to
actually execute a memory barrier on a periodic or episodic basis...

>> However, traversal through dynamic linked lists, well, I think
>> there may be a problem: You need to be sure that a proxy collector is not
>> destroyed unless all previous ones were destroyed; this allows for
>> traversals... Also, you need to follow FIFO ordering wrt the calling of
>> destructors of the proxy objects...

>
> If you want such semantics, then you must hold handle
> store::shared_handle/scoped_handle
> to the next node in the previous node. Then you get FIFO order of
> destruction and the previous node will not be destroyed before the
> next node.

I am not sure that that's even enough... An iteration through a linked-list
can cross proxy object boundaries... You need to chain proxy objects
together because a traversal of a linked list will span multiple proxy
objects... This is basically a huge race-condition. If you are having
trouble understanding this particular condition, I can post an example of
it...

Come to think of it... I think that Joe Seigh has already posted examples of
why you need to back-link the proxy objects... Humm...

Ahh yes... Here is a quote from Joe:

_________________
" Actually, you have to back link the collector objects so that newer
collector objects
which get created by subsequent node deletions stay alive as long as the
oldest
collector. So it would look something like this

I2 I1
| |
V V
X ==> Q3 <== Q2 <== Q1
| |
v v
n1 n2...

Q3 is the current collector object, X is a reference counted pointer,
atomic_ptr, to
it from the parent object. Q2 and Q1 are older collector objects back
linked
with atomic_ptr and with deleted nodes n1, n2, etc..., queued using normal
ptrs. The
destructors for collector object know to dequeue and delete the nodes. I2
and I1
are local_ptr references from active interators.

If the iterator for I1 finishes and drops, in this case the only, reference
to Q1,
Q1 and n2 etc all get deleted.

If the iterator for I2 finishes before I1, I2 drops its reference but Q2
stays alive
until Q1 gets deleted, so the I1 iterator can safely access n1 if it needs
to.

When an iterator get initialized, it simply gets a reference to the current
collector
object from X.

There's some optimizations in the current prototype not shown here but
that's the general
logic.

Joe Seigh "
____________________

Any thoughts?

Chris Thomasson

unread,

Apr 27, 2007, 8:36:05 PM4/27/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:h_-dneaz77Nsta3b...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1177517233.7...@n35g2000prd.googlegroups.com...
>> On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>>

[...]

>I believe your semantics will work fine for single objects...

[...]

Well, I think I found a potential problem that could cause a memory leak, or
crash...

Please correct me if I am wrong here:

AFAICT, store::acquire only increments the outer counter, and store::release
only decrements the inner counter right?

What if a reader thread does this:

void reader_thread(void* p) {
store& s = *(store*)p;
for (int i = 0; i < STOLEN_BITS + 1000; ++i) {
holder* h = s.acquire();
store::release(h);
}
}

?

Won't the outer counter go out-of-bounds and corrupt the count?

Chris Thomasson

unread,

Apr 27, 2007, 10:00:16 PM4/27/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:zqadnTjmd-EXC6_b...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:h_-dneaz77Nsta3b...@comcast.com...
>> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
>> news:1177517233.7...@n35g2000prd.googlegroups.com...
>>> On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>>>
> [...]
>
>>I believe your semantics will work fine for single objects...
>
> [...]
>
>
> Well, I think I found a potential problem that could cause a memory leak,
> or crash...
>
> Please correct me if I am wrong here:
>
> AFAICT, store::acquire only increments the outer counter, and
> store::release only decrements the inner counter right?

My pc_sample.c collector uses CAS so that it can decrement the outer counter
if the proxy object has not been already been deferred (e.g.,
replace_obj)... Therefore the outer counter (e.g, master reference count)
will not go out-of-bounds unless the user intentionally uses more threads
than stolen bits...

As for you collector, think about this:

max ref/thread/stolen_bit count: 3

static h1.inner{0, 0};
static s1.outer {0, h1};

Thread A
------------------
A1: lh = s1.acquire();
A2: s1.release(lh);
A3: repeat steps A1-A2 9 more times;
A4: done;

Possible Execution Sequence
------------------
E1: A1(s1.outer{1, h1} / h1.inner{0, 0});
E2: A2(s1.outer{1, h1} / h1.inner{0, -1});
E3: A3(...);
E4: A4(s1.outer{*10, h1} / h1.inner{0, -10});

Well, E4 violates the max refcount of 3 with a count of 10, and the user
only created a single thread...

Any thoughts? Is my logic following your algorithm correctly?

Dmitriy Vyukov

unread,

Apr 28, 2007, 9:13:46 AM4/28/07

to

On Apr 28, 4:36 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message
>
> news:h_-dneaz77Nsta3b...@comcast.com...> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

> >news:1177517233.7...@n35g2000prd.googlegroups.com...
> >> On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> [...]
>
> >I believe your semantics will work fine for single objects...
>
> [...]
>
> Well, I think I found a potential problem that could cause a memory leak, or
> crash...
>
> Please correct me if I am wrong here:
>
> AFAICT, store::acquire only increments the outer counter, and store::release
> only decrements the inner counter right?

Yes

>
> What if a reader thread does this:
>
> void reader_thread(void* p) {
> store& s = *(store*)p;
> for (int i = 0; i < STOLEN_BITS + 1000; ++i) {
> holder* h = s.acquire();
> store::release(h);
> }
>
> }
>
> ?
>
> Won't the outer counter go out-of-bounds and corrupt the count?

No.
Outer counter will be overflowed and inner counter will be overflowed
too.
But when the counter transfer occurs from outer counter to inner
counter, the sum will be correct.
For example, if we have 4 bits stolen. Counter can be from 0 to 15.
And we increment outer counter 17 times -> 17 mod 15 = 2 -> outer
counter will be 2
And inner counter will be -17 mod 15 = -2 -> inner counter will be -2
And when we sum them: 2 + -2 = 0 -> we get the right result
Hence the overflow of counters is the normal thing.
Here is only one restriction - there should not be readers more than
2^STOLEN_BITS. Because then inner counter can overflow _after_ counter
transfer from outer counter.

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 28, 2007, 9:20:43 AM4/28/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1177766026.1...@o5g2000hsb.googlegroups.com...

> On Apr 28, 4:36 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Chris Thomasson" <cris...@comcast.net> wrote in message
>>
>> news:h_-dneaz77Nsta3b...@comcast.com...> "Dmitriy Vyukov"
>> <dvyu...@gmail.com> wrote in message
>> >news:1177517233.7...@n35g2000prd.googlegroups.com...
>> >> On Apr 22, 2:50 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
[...]

>> Won't the outer counter go out-of-bounds and corrupt the count?
>
> No.
> Outer counter will be overflowed and inner counter will be overflowed
> too.

False Alarm! ;^)

> But when the counter transfer occurs from outer counter to inner
> counter, the sum will be correct.

[...]

Ahhh, okay; thank you for that clarification... Well, as of right now, I
can't really find anything wrong with the algorithm itself... However, I
still need to build a prototype of it. Humm... So far so good.

:^)

Dmitriy Vyukov

unread,

Apr 28, 2007, 9:46:24 AM4/28/07

to

On Apr 28, 6:00 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> My pc_sample.c collector uses CAS so that it can decrement the outer counter
> if the proxy object has not been already been deferred (e.g.,
> replace_obj)... Therefore the outer counter (e.g, master reference count)
> will not go out-of-bounds unless the user intentionally uses more threads
> than stolen bits...

Yes. I see this moment in pc_sample.c

> Well, E4 violates the max refcount of 3 with a count of 10, and the user
> only created a single thread...
> Any thoughts? Is my logic following your algorithm correctly?

No, my algorithm quite different.
I use the lsb in inner counter as flag "removed"
Here is the code:
http://groups.google.com/group/comp.programming.threads/tree/browse_frm/thread/6cfe6b1a6926680d/f364c9eef259f49b?rnum=21&hl=en&_done=%2Fgroup%2Fcomp.programming.threads%2Fbrowse_frm%2Fthread%2F6cfe6b1a6926680d%2F%3Fhl%3Den%26#doc_bbea1d0142adb5c7

The outer and inner counters can overflow many times. And this is the
normal operation mode.
When the object is deferred, then outer counter is added to inner
counter and flag "removed" in inner counter is set. No overflow must
be in inner counter only _after_ the "removed" flag is set - and this
is the point
And when the inner counter goes to zero _and_ removed flag is set, the
object is deleted

This is the difference, and because of this I can implement the
algorithm only with atomic_add, I don't need CAS at all!

Dmitriy V'jukov

Chris Thomasson

unread,

Apr 28, 2007, 10:07:29 AM4/28/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1177767984.2...@u30g2000hsc.googlegroups.com...

> On Apr 28, 6:00 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> My pc_sample.c collector uses CAS so that it can decrement the outer
>> counter
>> if the proxy object has not been already been deferred (e.g.,
>> replace_obj)... Therefore the outer counter (e.g, master reference count)
>> will not go out-of-bounds unless the user intentionally uses more threads
>> than stolen bits...
>
> Yes. I see this moment in pc_sample.c
>
>> Well, E4 violates the max refcount of 3 with a count of 10, and the user
>> only created a single thread...
>> Any thoughts? Is my logic following your algorithm correctly?
>
> No, my algorithm quite different.
> I use the lsb in inner counter as flag "removed"
> Here is the code:

I understand.

Dmitriy Vyukov

unread,

Apr 30, 2007, 7:05:16 AM4/30/07

to

On 28 апр, 01:57, "Chris Thomasson" <cris...@comcast.net> wrote:

> Okay. Let me see if I can orient my terminology with your implementation:
>
> Your proxy object /w private references is the holder class right?

yes

> Your proxy object anchor /w master references is the store class right?

yes

> :^)

;)

> > In this situation it gives extremely fast solution, i.e. no locks,
> > interlocked operations or membars on reader side...
>
> Well, you do have to use an atomic operation every single time you acquire
> or release a holder from a store. AFAICT, your current collector goes has
> follows:
>
> holder = store.acquire();
> membar #StoreLoad | #StoreStore;
>
> // read
>
> membar #LoadStore | #StoreStore;
> store.release(holder);
>
> So, your not really getting memory barrier free reads.

Yes, you're right.
But I can cache the object in thread-specific-storage (actually on
stack)

> Joe Seighs SMR+RCU and *vZOOM can both be configured to acquire, and release
> a reference to one of your holder objects without using any interlocked RMW,
> or even membars on non-alpha systems...
>
> *--
> vZOOM can be based on per-thread/cpu-schedule sync epochs. The polling
> thread is basically the proxy object, and the proxy object anchor's are
> setup per-thread... So a thread can acquire a reference to a proxy object by
> doing a simple increment to a per-thread counter. The thread only has to
> actually execute a memory barrier on a periodic or episodic basis...

It looks like my scheme is very similar to your vZOOM...
See details here:
http://groups.google.com/group/comp.programming.threads/tree/browse_frm/thread/6cfe6b1a6926680d/2eb0ff35d50e696e?rnum=21&hl=ru&_done=%2Fgroup%2Fcomp.programming.threads%2Fbrowse_frm%2Fthread%2F6cfe6b1a6926680d%2Fefe0fa93086267f8%3Ftvc%3D1%26hl%3Dru%26#doc_91d01886ae690185
in particular - struct local_mediator and the function "holder*
acquire(local_mediator* m)"

But seems, my algorithm a little different to vZOOM.
I don't need "per-thread/cpu-schedule sync epochs" and I don't need to

"The thread only has to actually execute a memory barrier on a
periodic or episodic basis".

In my scheme user sees the new object when he executes first acquire()
since the new object is installed in store. Here are no "epochs".
And second, user have to execute membars _only_ when the new object is
installed in the store. Here are no "periodic or episodic basis".

Here the piece of code for "cached acquire":

// Here:
// local_mediator - thread local cache
// local_mediator::counter - thread local acquire counter
// local_mediator::cache - thread local cache (points to holder
object)
holder* acquire(local_mediator* m)
{
// Check whether this thread already holds some object
// We don't allow to the thread to acquire 2 different
// versions of the same object

if (m->counter++)
return m->cache;

// Check whether there are some object in the cache
if (m->cache)
{
// Make ordinary load, _not_ atomic with membars

holder* h = (holder*)(outer.ptr << STOLEN_BITS);

// Check whether we hold the actual object

if (m->cache == h)
return m->cache;

// We hold not the actual object - release the old object
// and acquire the actual

release(m->cache);
}
m->cache = acquire();
return m->cache;
}

BTW where are the code for this "magical" vZOOM? :)
I don't see the code on http://home.comcast.net/~vzoom/
Or you don't open the code yet? It is concerned with patents? :)

> I am not sure that that's even enough... An iteration through a linked-list
> can cross proxy object boundaries... You need to chain proxy objects
> together because a traversal of a linked list will span multiple proxy
> objects... This is basically a huge race-condition. If you are having
> trouble understanding this particular condition, I can post an example of
> it...
>
> Come to think of it... I think that Joe Seigh has already posted examples of
> why you need to back-link the proxy objects... Humm...
>
> Ahh yes... Here is a quote from Joe:

> Any thoughts?

I don't understand one moment. Here:

> If the iterator for I2 finishes before I1, I2 drops its reference but Q2
> stays alive
> until Q1 gets deleted, so the I1 iterator can safely access n1 if it needs
> to.

If the thread already holds reference to some version of the object,
why he may want to access the other version of the same object?
I threat such situation as wrong.
I espesially preserve "transactional" semantics - if the thread sees
some version of the object, when this thread _must_not_ see any
other version of the same object before "release" of the first object.
In the terms of Joe example:
If I1points to n2, then I1 must even don't have the opportunity to see
the n1...
Else, in the terms of databases, there are don't preserve atomic
transactional semantics...
...may be here play some role moment, that we design solutions for
different use cases... I design to _one_ object, not the some set of
objects...

But again, if you want to get fifo destruction semantics with my
solution, than older objects must hold reference to newer objects.
In the terms of Joe example:
Q1 must hold reference to Q2, and Q2 must hold reference to Q1.
I think it must work...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 2, 2007, 2:03:43 AM5/2/07

to

On 28 апр, 01:34, "Chris Thomasson" <cris...@comcast.net> wrote:

> Well, AFAICT the algorithm should work. Humm... IMHO, your counting
> technique is "kind of" looking more and more like atomic_ptr... I think I
> could fairly easily rig atomic_ptr to use the alignment trick with a bounded

> count. Then atomic_ptr would not have to use compare-and-swap. > What do ya think Joe? ;^)

I think that you can use such scheme in pc_sample and Joe Seigh can
use in atomic_ptr. It sould be applicable...
It is difficult to say how much it is better, but it definitely not
worser. I mean atomic_add vs. cas loop.

Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 2, 2007, 8:40:33 AM5/2/07

to

On Apr 30, 3:05 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> But seems, my algorithm a little different to vZOOM.
> I don't need "per-thread/cpu-schedule sync epochs" and I don't need to
> "The thread only has to actually execute a memory barrier on a
> periodic or episodic basis".
>
> In my scheme user sees the new object when he executes first acquire()
> since the new object is installed in store. Here are no "epochs".
> And second, user have to execute membars _only_ when the new object is
> installed in the store. Here are no "periodic or episodic basis".

Chris, why you need "epochs" or "periods" to update reader local
value?
Why not just read (not atomic read, just plain load) the actual global
value and compare with reader local value, and issue all those atomic
operations only when those values differs?

Dmitriy V'jukov

Chris Thomasson

unread,

May 2, 2007, 9:13:23 AM5/2/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1178109633.1...@o5g2000hsb.googlegroups.com...

> On Apr 30, 3:05 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
>> But seems, my algorithm a little different to vZOOM.
>> I don't need "per-thread/cpu-schedule sync epochs" and I don't need to
>> "The thread only has to actually execute a memory barrier on a
>> periodic or episodic basis".
>>
>> In my scheme user sees the new object when he executes first acquire()
>> since the new object is installed in store. Here are no "epochs".
>> And second, user have to execute membars _only_ when the new object is
>> installed in the store. Here are no "periodic or episodic basis".
>
> Chris, why you need "epochs" or "periods" to update reader local
> value?

I can *configure the pdr portion of the library so it can handle all of the
epoch and period detection behind the scenes in a way that relives reader,
or writer threads from accessing global variables. This usually comes in
real handy on NUMA, or other similar architecture types simply because its
"naturally" scaleable. It can reliably deploy on highly distributed computer
systems. If your making frequent access to global variables, well, that can
limit your algorithms portability to such environments... BTW, NUMA
programming tricks and techniques can be beneficial in more compact
multi-core based systems as well...

The epoch polling entities can be scaleable as well. You can partition them
per-group of nodes in a distributed system.

> Why not just read (not atomic read, just plain load) the actual global
> value and compare with reader local value, and issue all those atomic
> operations only when those values differs?

I wanted to try to eliminate the need for global variable access... The pdr
anchors are per-thread in vZOOM for performance and portability to heavily
distributed systems with weak cache-coherency. I guess I was just covering
my bases so that I could say yes to most of the questions that go like: "can
vzoom run on the arm9", "can it run on arch x that is private to our
company..." I can say, well if you provide a spinlock, or mutex, or POSIX,
it can run.

--
*

It is possible to get high-performance PDR in a POSIX environment. Since the
automatic detection thing is highly non-portable, a user needs to follow a
usage pattern. I can use POSIX APIS and a single atomic load-dependant
function to design something that can beat its own read/write mutexs.

I had to give the library the ability to run on all sorts of platforms for
which I have no customized assembly language support library written for. I
figured I had to create a robust version of my library that relied on at
least 99% POSIX API's and still have the ability to out perform POSIX
read/write mutexs. I can get it up and running by only porting a single
assembly language function in the form of a simple load.

Dmitriy Vyukov

unread,

May 3, 2007, 11:05:30 AM5/3/07

to

On May 2, 5:13 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> > Chris, why you need "epochs" or "periods" to update reader local
> > value?
>

> I can *configure the pdr portion of the library so it can handle...

Oh. I see. I just minimize "CAS count" and you go much further...
I don't even think about minimizing global data access...

Dmitriy V'jukov

Chris Thomasson

unread,

May 3, 2007, 5:53:25 PM5/3/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:wd-dnZlxhstpEKXb...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1178109633.1...@o5g2000hsb.googlegroups.com...

[...]

> It is possible to get high-performance PDR in a POSIX environment. Since
> the automatic detection thing is highly non-portable, a user needs to

> follow a usage pattern. [...]

To clarify, the POSIX-Based PDR does not have automatic epoch detection. A
user thread must make use of a usage pattern in order go make efficient use
of the PDR scheme.

Chris Thomasson

unread,

May 4, 2007, 3:38:29 AM5/4/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1178204730.4...@l77g2000hsb.googlegroups.com...

The memory allocator I invented also does not need to access any global
data-structures... The actual algorithm is nice and neat.

Dmitriy Vyukov

unread,

May 4, 2007, 4:05:10 AM5/4/07

to

On May 4, 11:38 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> The memory allocator I invented also does not need to access any global
> data-structures... The actual algorithm is nice and neat.

Now I take one more look at:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82

Yes, it is nice and neat :)
I discern - other threads push free blocks to shared_anchor of the
thread allocator and then, when thread local blocks is over, thread
get all blocks from shared_anchor at once - pretty nice :)

But stop. In some usage patters - for example one thread enqueue nodes
to shared queue and other threads dequeue nodes from queue and free
nodes - all consumer threads will heavily compete for shared_anchor of
first thread allocator...

Dmitriy V'jukov

Chris Thomasson

unread,

May 5, 2007, 8:33:40 PM5/5/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1178265910.8...@e65g2000hsc.googlegroups.com...

> On May 4, 11:38 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> The memory allocator I invented also does not need to access any global
>> data-structures... The actual algorithm is nice and neat.
>
> Now I take one more look at:
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82
>
> Yes, it is nice and neat :)
> I discern - other threads push free blocks to shared_anchor of the
> thread allocator and then, when thread local blocks is over, thread
> get all blocks from shared_anchor at once - pretty nice :)

Yup. A brief description in section 1.1:

http://appcore.home.comcast.net/vzoom/round-2.pdf

If a thread tries to free a block that it did not allocate, then an
interlocked RMW instruction will be used. Otherwise, plain old loads and
stores will be used.

> But stop. In some usage patters - for example one thread enqueue nodes
> to shared queue and other threads dequeue nodes from queue and free
> nodes - all consumer threads will heavily compete for shared_anchor of
> first thread allocator...

Yes your correct. Well, at least it all lock-free...

:^)

Anyway, IMHO, its virtually impossible to get a general purpose
multi-threaded memory allocator that can address 100% of the usage patterns
without falling back on a least an interlocked RMW instruction...

Dmitriy Vyukov

unread,

May 6, 2007, 2:40:09 AM5/6/07

to

On 6 май, 04:33, "Chris Thomasson" <cris...@comcast.net> wrote:

> > But stop. In some usage patters - for example one thread enqueue nodes
> > to shared queue and other threads dequeue nodes from queue and free
> > nodes - all consumer threads will heavily compete for shared_anchor of
> > first thread allocator...
>
> Yes your correct. Well, at least it all lock-free...
>
> :^)

I think of fifo-queue scheme, that can help in such situation (scheme
applicable to spsc or spmc pattern):

NODE -> NODE -> NODE -> NODE -> NODE -> NODE -> NODE
/\ /
\ /\
|
| |
TAIL CONSUME_POS
PRODUCE_POS

(I hope pseudo-graphics turn out well...)

Here consumers don't actually free/recycle nodes - they just move the
CONSUME_POS. And producer himself, when he need new nodes, get nodes
from TAIL.
So NODES are acquired/recycled by one thread - producer - so we
eliminate one CAS operation...

Any thoughts? ;)

Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 6, 2007, 2:44:00 AM5/6/07

to

On 6 май, 10:40, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> NODE -> NODE -> NODE -> NODE -> NODE -> NODE -> NODE
> /\ /
> \ /\
> |
> | |
> TAIL CONSUME_POS
> PRODUCE_POS
>
> (I hope pseudo-graphics turn out well...)

It seems pseudo-graphics is broken...
Try #2:

NODE -> NODE -> NODE -> NODE -> NODE
/\ /\ /\
| | |
TAIL CONSUME_POS PRODUCE_POS

Dmitriy V'jukov

Chris Thomasson

unread,

May 6, 2007, 3:12:57 AM5/6/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1178433609.1...@p77g2000hsh.googlegroups.com...

I already mention this in a thread titled something like "is lock-free
usefull?" or something... You can use multiplexed per-thread spsc queues for
distributed per-thread memory cache. I can't seem to find it on google...
For some reason, it seems like it lost some posts... I can't find some of
the older stuff I posted at the moment... Strange.

For instance, I could normally find the code for my eventcount algorithm my
searching the group with google using the following phrase:

"really simple portable eventcount"

Now, nothing shows up anymore! Crap... I think google lost some posts...
Humm...

My algorithm breaks it down to a multi-producer and single-consumer queue.
That's what you need. Of course you could implement the algorithm with
multiplex per-thread spsc queue to get the same effect. The latter would
probably do better in distributed systems.

Chris Thomasson

unread,

May 6, 2007, 3:20:17 AM5/6/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:3pOdndYSV_JHvKDb...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1178265910.8...@e65g2000hsc.googlegroups.com...
>> On May 4, 11:38 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>>

[...]

>
> Anyway, IMHO, its virtually impossible to get a general purpose
> multi-threaded memory allocator that can address 100% of the usage
> patterns without falling back on a least an interlocked RMW instruction...

I guess it may be impossible in a sense... Well, think "general purpose"
malloc/free implementation like vZOOM will try to fast-path the allocation
through the algorithm, and fall back on system-calls to allocate memory for
anything that the fast-path could not handle. The system calls will most
likely be using a clever locking strategy or may be lock-free themselves...
Who knows...

Chris Thomasson

unread,

May 6, 2007, 3:26:19 AM5/6/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:qISdnRP_zY7t4qDb...@comcast.com...

[...]

the following refers to the allocator, not the eventcount:

> My algorithm breaks it down to a multi-producer and single-consumer queue.
> That's what you need. Of course you could implement the algorithm with
> multiplex per-thread spsc queue to get the same effect. The latter would
> probably do better in distributed systems.

I remember posting this idea in the context of lock-free scaleable message
passing...

I can't find it in google! WTF...

Dmitriy Vyukov

unread,

May 6, 2007, 3:44:30 AM5/6/07

to

On 6 май, 10:44, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

Try #3:

Chris Thomasson

unread,

May 6, 2007, 11:59:47 PM5/6/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:qISdnRP_zY7t4qDb...@comcast.com...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1178433609.1...@p77g2000hsh.googlegroups.com...

[...]

> For instance, I could normally find the code for my eventcount algorithm
> my searching the group with google using the following phrase:
>
> "really simple portable eventcount"

Strange how this search turned up no results for a couple of days, then all
of a sudden it works... Seems like I searched during some sort of
maintenance anomaly something..

Dmitriy Vyukov

unread,

May 8, 2007, 11:18:16 AM5/8/07

to

On May 6, 11:12 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> I already mention this in a thread titled something like "is lock-free
> usefull?" or something... You can use multiplexed per-thread spsc queues for
> distributed per-thread memory cache.

It seems that you say about different thing - I don't propose to
multiplex queues...

> I can't seem to find it on google...
> For some reason, it seems like it lost some posts... I can't find some of
> the older stuff I posted at the moment... Strange.

Yes, I notice some message lost too...

> My algorithm breaks it down to a multi-producer and single-consumer queue.
> That's what you need. Of course you could implement the algorithm with
> multiplex per-thread spsc queue to get the same effect. The latter would
> probably do better in distributed systems.

You say about different thing. You say about mpsc, and I say about
spmc...
I post my idea in dedicated thread...

Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 8, 2007, 11:22:24 AM5/8/07

to

On May 6, 11:26 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> I remember posting this idea in the context of lock-free scaleable message
> passing...

Are you mean this?
http://groups.google.com/group/comp.programming.threads/msg/1d636b02dd724f0f?hl=en&

I say about different thing...

Dmitriy V'jukov

Chris Thomasson

unread,

May 8, 2007, 7:05:05 PM5/8/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1178637496.7...@w5g2000hsg.googlegroups.com...

> On May 6, 11:12 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> I already mention this in a thread titled something like "is lock-free
>> usefull?" or something... You can use multiplexed per-thread spsc queues
>> for
>> distributed per-thread memory cache.
>
> It seems that you say about different thing - I don't propose to
> multiplex queues...

Yup. Its different.

>
>> I can't seem to find it on google...
>> For some reason, it seems like it lost some posts... I can't find some of
>> the older stuff I posted at the moment... Strange.
>
> Yes, I notice some message lost too...

Strange.

[...]

> You say about different thing. You say about mpsc, and I say about
> spmc...
> I post my idea in dedicated thread...

That should make things easier to read indeed!

;^)

Chris Thomasson

unread,

May 8, 2007, 7:06:02 PM5/8/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1178637744....@n59g2000hsh.googlegroups.com...

> On May 6, 11:26 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> I remember posting this idea in the context of lock-free scaleable
>> message
>> passing...
>
> Are you mean this?
> http://groups.google.com/group/comp.programming.threads/msg/1d636b02dd724f0f?hl=en&

Ahh yeah. My post is in that thread for sure.

> I say about different thing...

Agreed.

Chris Thomasson

unread,

May 9, 2007, 4:56:58 PM5/9/07

to

For the moment, I can't find any problems with the algorithm itself.

Everything looks okay to me.