Advantages:
+ Able to sustain very high read rate and high write rate
+ Read access is always lock-free, mostly wait-free (only very
retarded readers aborted and retry) and basically costless
+ Write access is lock-free and requires single atomic RMW
+ Does not require any form of "safe memory reclamation"
+ Bounded memory consumption (memory consumption is (number_of_writers
+ number_of_mutexes)*sizeof(protected_object))
+ Possibility of read-to-write upgrade
Disadvantages:
- Since read side is based on SeqLock, all data must be completely
embed into protected object (i.e. no pointers to external objects)
- Writers have to retry the operation on contention
- Requires double-word atomic operations
Where it may be applied? I think it may be applied in parallel branch-
and-bound computations, where worker threads have to read current best
solution on each iteration as well as constantly update it. Another
possible application may be in an implementation of a TCP/IP stack to
manage connection's state; all connection related operations (user
calls via socket API, hardware interrupt handlers, internal kernel
threads) have to read connection's state, and some of them have to
update some fields in connection's state. General pattern is: very
high read rate + high write rate.
Here is brief overview of the design. Reader side is based on Improved
SeqLock:
{URL}
The innovation is on the writer side, how writers manage memory and
how they interact with readers.
The design consists of 2 components - mutex itself and object cache
(which is shared between all the mutexes that protect objects of the
same type). Writer operates as follows. First he acquires new object
from the cache, then it copies data from the current object and
applies requires modifications (plain COW), after that he tries to
commit. If commit fails he restarts the operation. During the
operation, writer manages sequence numbers (required for / read-side)
accordingly.
Regarding object cache. Mutex contains current version of user object,
also each writer thread caches 1user object in thread-local storage.
When a thread starts write transaction he gets an object from thread-
local storage and applies modifications to it, then it tries to swap
the object with current object (contained within mutex), if swap
succeeds thread puts previously current object to thread-local cache.
That way total number of objects never exceeds (number_of_writers + 1)
(or if you have several mutexes - (number_of_writers +
number_of_mutexes)).
Here is read pattern (the same as you have with SeqLock):
// user data
struct volume_t
{
//...
};
// it's the mutex
rw_store<volume_t> store;
// read transaction
volume_t v;
rw_store<volume_t>::state_t st = 0;
do
{
volume_t const& v0 = store.read_begin(st);
v = v0;
}
while (!store.read_commit(st));
// at this point reader has consistent copy of the data
And here is write pattern (mix of COW, SeqLock and STM):
rw_store<volume_t>::state_t rst = 0;
rw_store<volume_t>::state_t wst = 0;
for (;;)
{
// threads starts as a reader
volume_t const& v0 = store.read_begin(rst);
// v0 points to 'current object'
// then he decides as to whether he wants to do a modification of
not
if ((v0.x + v0.y + v0.z) % 9 != iter % 9)
{
// upgrade to writer
volume_t& v2 = store.write_begin(rst, wst);
// v2 points to object to which we must apply modifications
// apply COW modification
v2.x = v0.x + inc;
v2.y = v0.y + inc;
v2.z = v0.z + inc;
v2.dx = v0.dx + inc;
v2.dy = v0.dy + inc;
v2.dz = v0.dz + inc;
// let's try to commit write
if (store.write_commit(rst, wst))
break;
}
else
{
// do not want to do a modification
// let's try to commit as a reader
if (store.read_commit(rst))
break;
}
}
// flush internal cache
// required if a thread tried to do a write,
// but finally committed as a reader
store.write_reset(wst);
Ok, here is the implementation (just some 20 lines of code):
template<typename T>
struct node
{
uintptr_t counter_;
node* next_;
T data_;
char pad [CACHE_LINE_SIZE];
};
template<typename T>
class rw_store
{
public:
rw_store()
{
// setup 'current' object
anchor_.node_ = node_cache<T>::alloc();
anchor_.counter_ = anchor_.node_->counter_;
}
~rw_store()
{
// release 'current' object to cache
node_cache<T>::free(anchor_.node_);
}
typedef dword_t state_t;
// returns 'current' object
T const& read_begin(state_t& st)
{
// just do atomic double-word load of current object
// along with sequence number
// note that here is no blocking as in classical SeqLock
// current version is always consistent and available for
reading
st = atomic_dwload_acq_read((dword_t*)&anchor_);
anchor& anc = *(anchor*)&st;
return anc.node_->data_;
}
bool read_commit(state_t st)
{
// load and verify object version
anchor& anc = *(anchor*)&st;
word_t cmp = atomic_load_rel_read((word_t*)&anc.node_-
>counter_);
return anc.counter_ == cmp;
}
// returns pointer to an object to which a writer
// must apply modifications
T& write_begin(state_t rst, state_t& wst)
{
anchor& wanc = *(anchor*)&wst;
// get new object from thread-local cache
// (if the thread already tried to do a write in this
transaction,
// then he already has a new object, so nothing to do)
if (wanc.node_ == 0)
{
wanc.node_ = node_cache<T>::get();
// increment object's sequence number
// only after the increment retarded readers
// that still try to reader this object will be aborted
atomic_store_acq_write((uint32_t*)&wanc.node_->counter_,
wanc.node_->counter_ + 1);
wanc.counter_ = wanc.node_->counter_;
}
return wanc.node_->data_;
}
bool write_commit(state_t rst, state_t& wst)
{
// try to commit write by swapping current version with our
new
if (!atomic_dwcas_rel((dword_t*)&anchor_, &rst, wst))
return false;
// object contained within 'wst' will be put to
// thread-local cache in write_reset()
wst = rst;
return true;
}
void write_reset(state_t wst)
{
// just put object to thread-local cache
anchor& anc = *(anchor*)&wst;
if (anc.node_)
node_cache<T>::set(anc.node_);
}
private:
struct anchor
{
uintptr_t counter_;
node<T>* node_;
};
// object must be dword aligned
anchor anchor_;
char pad [CACHE_LINE_SIZE];
};
And here is object cache (no rocket since here, it just holds per-
thread object):
template<typename T>
class node_cache
{
public:
static node<T>* get()
{
node<T>* n = (node<T>*)pthread_getspecific(instance.slot_);
if (n)
return n;
return alloc();
}
static void set(node<T>* n)
{
pthread_setspecific(instance.slot_, n);
}
static node<T>* alloc()
{
pthread_mutex_lock(&instance.mtx_);
node<T>* n = instance.head_;
if (n)
instance.head_ = n->next_;
pthread_mutex_unlock(&instance.mtx_);
if (n)
return n;
n = new node<T>;
n->counter_ = 0;
n->next_ = 0;
return n;
}
static void free(node<T>* n)
{
pthread_mutex_lock(&instance.mtx_);
n->next_ = instance.head_;
instance.head_ = n;
pthread_mutex_unlock(&instance.mtx_);
}
private:
pthread_mutex_t mtx_;
pthread_key_t slot_;
node<T>* head_;
static node_cache instance;
node_cache()
{
pthread_mutex_init(&mtx_, 0);
pthread_key_create(&slot_, (void(*)(void*))
&node_cache<T>::free);
head_ = 0;
}
};
template<typename T>
node_cache<T> node_cache<T>::instance;
That's it.
In the next post I will provide results of a simple synthetic
benchmark against SeqLock, my Improvied SeqLock, pthread_mutex_t,
pthread_spinlock_t, pthread_swrlock_t and CRITICAL_SECTION.
–
Dmitriy V'jukov
Benchmark description: main thread creates N worker threads, each
worker thread executes M iterations. On each iteration worker executes
read transaction, then does verification of read data, then makes a
bit of local work, every K-th iteration it "decides" to do a write
transaction. In order to do a write transaction, he starts as a reader
first, then based on current object state he decides as to whether he
actually need to do a write or not; if he decides to do an actual
write he upgrades to writer, updates data structure and tries to
commit, otherwise he tries to commit as a reader.
Ah, Ok, you better just read source code:
// user object
struct volume_t
{
int x, y, z;
int dx, dy, dz;
volume_t()
{
x = y = z = dx = dy = dz = 0;
}
};
extern "C"
void* thread_func(void* p)
{
rw_store<volume_t>& store = *(rw_store<volume_t>*)p;
int volatile sum = 0;
for (long iter = 0; iter != iter_count; iter += 1)
{
volume_t v;
rw_store<volume_t>::state_t st = 0;
do
{
volume_t const& v0 = store.read_begin(st);
v = v0;
}
while (!store.read_commit(st));
sum += v.x*v.x + v.y*v.y + v.z*v.z + v.dx*v.dx + v.dy*v.dy +
v.dz*v.dz;
if (v.x != v.y || v.x != v.z || v.x != v.dx || v.x != v.dy ||
v.x != v.dz)
exit(printf("INCONSISTENT DATA\n"));
if ((iter % write_ratio) == 0)
{
int inc = iter % 3 + 1;
rw_store<volume_t>::state_t rst = 0;
rw_store<volume_t>::state_t wst = 0;
for (;;)
{
volume_t const& v0 = store.read_begin(rst);
v = v0;
if ((v0.x + v0.y + v0.z) % 9 != iter % 9)
{
volume_t& v2 = store.write_begin(rst, wst);
v2.x = v0.x + inc;
v2.y = v0.y + inc;
v2.z = v0.z + inc;
v2.dx = v0.dx + inc;
v2.dy = v0.dy + inc;
v2.dz = v0.dz + inc;
if (store.write_commit(rst, wst))
break;
}
else
{
if (store.read_commit(rst))
break;
}
}
store.write_reset(wst);
if (v.x != v.y || v.x != v.z || v.x != v.dx || v.x != v.dy
|| v.x != v.dz)
exit(printf("INCONSISTENT DATA\n"));
}
}
return (void*)sum;
}
For other synchronization primitives I implement similar functions.
Here are results of the benchmark for Windows on Intel 2 processor/8
core/16 hw threads Xeon system. Number of iteration (M) =1000000,
write rate (K) = 10. Time is in seconds. Here is a graph against
SeqLock, Improved SeqLock and CRITICAL_SECTION:
http://picasaweb.google.com/dvyukov/LockFree#5431125797728327778
And here is graph with w\o CRITICAL_SECTION:
http://picasaweb.google.com/dvyukov/LockFree#5431125800615069186
Here are results of the benchmark for Solaris on Sun 8 core/64 hw
threads UltraSPARC T2 system. All parameters are the same. Here is a
graph against SeqLock, Improved SeqLock, pthread_mutex_t,
pthread_spinlock_t and pthread_rwlock_t:
http://picasaweb.google.com/dvyukov/LockFree#5431125803404587058
And here is graph w\o pthread primitives:
http://picasaweb.google.com/dvyukov/LockFree#5431125802261932994
As you may see, the primitive substantially improves on SeqLock
regarding write rate.
–
Dmitriy V'jukov
Sorry, but what about link ? :)
Damn!
Here it is:
http://blogs.sun.com/dave/resource/Asymmetric-Dekker-Synchronization.txt
And here is my silly explanation of asymmetric synchronization:
http://groups.google.com/group/lock-free/msg/5372151199fb7dde
(Yes, I am still going to write 'an article' on this... I may be in
this state indefinitely long, though :) )
--
Dmitriy V'jukov
Damn^2! Sorry for that nonsense. I'm currently editing next post on
asymmetric countdown latch, and the links I provided are intended for
{URL} placeholder in *that* article, not in *this* article.
Here is the link for Improved SeqLock:
http://groups.google.com/group/lock-free/browse_frm/thread/5705265ef7a1a1a3
--
Dmitriy V'jukov
Damn^2! Sorry for that nonsense. I'm currently editing next post on
asymmetric countdown latch, and the links I provided are intended for
{URL} placeholder in *that* article, not in *this* article.
Here is the link for Improved SeqLock:
http://groups.google.com/group/lock-free/browse_frm/thread/5705265ef7a1a1a3