Scalable hash map: The defeat of lock-based synchronization

Dmitriy V'jukov

unread,

Sep 12, 2008, 3:47:42 PM9/12/08

to

I've implemented a scalable concurrent hash map, which uses a bunch of
advanced synchronization techniques, like timestamp based validation,
deferred memory reclamation and cache conscious data layout. Timestamp
based validation is used for validation of read transactions. Deferred
memory reclamation uses a kind of amortized proxy-collector technique,
and together with timestamp based validation provides pure read-only
read transactions, i.e. read transactions don't modify any shared
state. Also read transactions are totally lock-free, i.e. they can
progress concurrently with write transactions and even with concurrent
resize operation. Write transactions use extremely fine-grained
locking, and usually touch/modify only one cache-line.
As a result this implementation beats Intel Threading Building
Blocks's concurrent_hash_map by a factor of 50 on read-mostly workload
on Intel Core 2 Quad (Q6600), and by a factor of 20 on modest write
workload. The cost of read transaction (find operation) is about 30
cycles, which is basically equal to that of single-threaded hash map.
Algorithm also have perfect linear scalability on read-mostly workload
so on greater number of cores/processors it will have even higher
performance difference with traditional lock-based synchronization (on
8 cores I expect >100x).

Here are benchmark results on Q6600.
Element count = 256, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 338 cycles/op
Cores 0&1: 436 cycles/op (scaling 1.55)
Cores 0&2: 1080 cycles/op (scaling 0.63)
All cores: 1412 cycles/op (scaling 0.96)

My hash map:
Single core: 30 cycles/op
Cores 0&1: 30 cycles/op (scaling 2)
Cores 0&2: 30 cycles/op (scaling 2)
All cores: 28 cycles/op (scaling 4)

Element count = 65536, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 383 cycles/op
Cores 0&1: 424 cycles/op (scaling 1.80)
Cores 0&2: 1168 cycles/op (scaling 0.66)
All cores: 1452 cycles/op (scaling 1.06)

My hash map:
Single core: 42 cycles/op
Cores 0&1: 42 cycles/op (scaling 2)
Cores 0&2: 42 cycles/op (scaling 2)
All cores: 40 cycles/op (scaling 4)

Element count = 256, mixed workload (4 threads make 100% finds; 4
threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 348 cycles/op
Cores 0&1: 424 cycles/op (scaling 1.64)
Cores 0&2: 954 cycles/op (scaling 0.73)
All cores: 1216 cycles/op (scaling 1.14)

My hash map:
Single core: 51 cycles/op
Cores 0&1: 54 cycles/op (scaling 1.89)
Cores 0&2: 62 cycles/op (scaling 1.65)
All cores: 60 cycles/op (scaling 3.4)

Element count = 65536, mixed workload (4 threads make 100% finds; 4
threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 377 cycles/op
Cores 0&1: 412 cycles/op (scaling 1.83)
Cores 0&2: 1032 cycles/op (scaling 0.73)
All cores: 1288 cycles/op (scaling 1.17)

My hash map:
Single core: 74 cycles/op
Cores 0&1: 70 cycles/op (scaling 2.12)
Cores 0&2: 86 cycles/op (scaling 1.72)
All cores: 88 cycles/op (scaling 3.36)

I will post implementation in following post.

Dmitriy V'jukov
--
Relacy Race Detector: Make your synchronization correct!
http://groups.google.ru/group/relacy/web

Dmitriy V'jukov

unread,

Sep 12, 2008, 3:51:57 PM9/12/08

to

On Sep 12, 11:47 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> I will post implementation in following post.

Implementation is NOT PRODUCTION READY, it's just an illustration of
the algorithm. Implementation can be compiled only with MSVC/Win32,
but algorithm is fairly portable, only single-word atomic RMW
operations are required, and very little of OS specific stuff.

/*********************** hash_map.h ***********************/

#pragma once
#include "pcx.h"
#include <new>
#ifndef _WIN32_WINNT
#define _WIN32_WINNT 0x500
#endif
#include <windows.h>
#include <malloc.h>
#include <intrin.h>
#pragma intrinsic (_InterlockedExchange)

namespace rl
{

template<typename key_t, typename value_t>
struct concurrent_hash_map_traits
{
typedef key_t key_in_t;
typedef key_t& key_out_t;
typedef value_t value_in_t;
typedef value_t& value_out_t;

static size_t hash(key_in_t k)
{
return static_cast<size_t>(k);
}

static bool compare_key(key_in_t k1, key_in_t k2)
{
return k1 == k2;
}

static void copy_key(key_out_t k1, key_in_t k2)
{
k1 = k2;
}

static void copy_value(value_out_t v1, value_in_t v2)
{
v1 = v2;
}

static void atomic_copy_value(value_out_t v1, value_in_t v2)
{
v1 = v2;
}

static void construct_key(void* k)
{
new (k) key_t ();
}

static void construct_value(void* v)
{
new (v) value_t ();
}

static void destroy_key(key_t* k)
{
}

static void destroy_value(value_t* v)
{
}
};

template<typename type>
struct is_pod
{
static bool const result = __is_pod(type);
};

template<> struct is_pod<int>
{static bool const result = true;};
template<> struct is_pod<unsigned int>
{static bool const result = true;};
template<> struct is_pod<long>
{static bool const result = true;};
template<> struct is_pod<unsigned long>
{static bool const result = true;};
template<> struct is_pod<__int64>
{static bool const result = true;};
template<> struct is_pod<unsigned __int64>
{static bool const result = true;};

template<typename key_t, typename value_t,
size_t key_size, bool key_pod, size_t value_size, bool value_pod>
struct concurrent_hash_map_layout;

template<typename key_t, typename value_t>
struct concurrent_hash_map_layout<key_t, value_t, 4, true, 4, true>
{
static size_t const initial_size = 256;
static size_t const cell_to_add_cell_ratio = 128;
static size_t const cell_item_count = 3;
static size_t const add_cell_item_count = 10;

static unsigned const state_count = 0x1;
static unsigned const state_count_mask = 0x3;
static unsigned const state_mask_bit = 0x4;
static unsigned const state_add_list = 0x40;
static unsigned const state_lock = 0x80;
static unsigned const state_ver = 0x100;

struct add_item_t
{
add_item_t* volatile next_;
key_t key_;
value_t value_;
};

struct add_cell_t
{
unsigned volatile lock_;
add_item_t* head_;
add_item_t item_ [add_cell_item_count];
};
};

template<typename key_t, typename value_t>
struct concurrent_hash_map_layout<key_t, value_t, 8, true, 8, true>
{
static size_t const initial_size = 256;
static size_t const cell_to_add_cell_ratio = 128;
static size_t const cell_item_count = 3;
static size_t const add_cell_item_count = 10;

static unsigned const state_count = 0x1;
static unsigned const state_count_mask = 0x3;
static unsigned const state_mask_bit = 0x4;
static unsigned const state_add_list = 0x40;
static unsigned const state_lock = 0x80;
static unsigned const state_ver = 0x100;

struct add_item_t
{
add_item_t* volatile next_;
key_t key_;
value_t value_;
};

struct add_cell_t
{
unsigned volatile lock_;
add_item_t* head_;
add_item_t item_ [add_cell_item_count];
char unused_pad_ [8];
};
};

template<typename key_t, typename value_t>
struct concurrent_hash_map_layout<key_t, value_t, 4, true, 8, true>
{
static size_t const initial_size = 256;
static size_t const cell_to_add_cell_ratio = 128;
static size_t const cell_item_count = 4;
static size_t const add_cell_item_count = 7;

static unsigned const state_count = 0x1;
static unsigned const state_count_mask = 0x3;
static unsigned const state_mask_bit = 0x4;
static unsigned const state_add_list = 0x40;
static unsigned const state_lock = 0x80;
static unsigned const state_ver = 0x100;

struct add_item_t
{
add_item_t* volatile next_;
key_t key_;
value_t value_;
};

struct add_cell_t
{
unsigned volatile lock_;
add_item_t* head_;
add_item_t item_ [add_cell_item_count];
char unused_pad_ [8];
};
};

template<typename key_t, typename value_t>
struct concurrent_hash_map_layout<key_t, value_t, 8, true, 4, true>
{
static size_t const initial_size = 256;
static size_t const cell_to_add_cell_ratio = 128;
static size_t const cell_item_count = 4;
static size_t const add_cell_item_count = 7;

static unsigned const state_count = 0x1;
static unsigned const state_count_mask = 0x3;
static unsigned const state_mask_bit = 0x4;
static unsigned const state_add_list = 0x40;
static unsigned const state_lock = 0x80;
static unsigned const state_ver = 0x100;

static size_t const add_item_size = 128;

struct add_item_t
{
key_t key_;
value_t value_;
add_item_t* volatile next_;
};

struct add_cell_t
{
unsigned volatile lock_;
add_item_t* head_;
add_item_t item_ [add_cell_item_count];
char unused_pad_ [8];
};
};

template<typename key_t, typename value_t,
typename traits_t = concurrent_hash_map_traits<key_t, value_t> >
class concurrent_hash_map
{
public:
typedef typename traits_t::key_in_t key_in_t;
typedef typename traits_t::key_out_t key_out_t;
typedef typename traits_t::value_in_t value_in_t;
typedef typename traits_t::value_out_t value_out_t;
typedef size_t hash_t;
typedef concurrent_hash_map_layout<key_t, value_t,
sizeof(key_t), is_pod<key_t>::result,
sizeof(value_t), is_pod<value_t>::result> layout_t;

static size_t const initial_size = layout_t::initial_size;
static size_t const cell_to_add_cell_ratio =
layout_t::cell_to_add_cell_ratio;
static size_t const cell_item_count = layout_t::cell_item_count;
static size_t const add_cell_item_count =
layout_t::add_cell_item_count;

static unsigned const state_count = layout_t::state_count;
static unsigned const state_count_mask =
layout_t::state_count_mask;
static unsigned const state_mask_bit = layout_t::state_mask_bit;
static unsigned const state_add_list = layout_t::state_add_list;
static unsigned const state_lock = layout_t::state_lock;
static unsigned const state_ver = layout_t::state_ver;

typedef typename layout_t::add_item_t add_item_t;
typedef typename layout_t::add_cell_t add_cell_t;

struct cell_t
{
unsigned volatile state_;
add_item_t* volatile head_;
key_t key_ [cell_item_count];
value_t value_ [cell_item_count];
};

struct block_hdr_t : pcx_node
{
hash_t cell_mask_;
size_t cell_count_;
size_t add_cell_count_;
cell_t* cell_;
add_cell_t* add_cell_;
void* mem_;
};

struct handle_t
{
cell_t* cell_;
unsigned state_;
value_t* value_;
};

concurrent_hash_map()
{
anchor_ = create_block(initial_size);
resizing_ = 0;
}

~concurrent_hash_map()
{
block_dtor(anchor_);
}

bool find(key_in_t k, value_out_t v)
{
hash_t const hash = traits_t::hash(k);
retry:
block_hdr_t* b = anchor_; // memory_order_consume
size_t const cell_idx = hash & b->cell_mask_;
cell_t& cell = b->cell_[cell_idx];
unsigned const state = cell.state_; // load_acquire_wrt_loads
unsigned const item_count = state & state_count_mask;
if (item_count > 0)
{
if (traits_t::compare_key(k, cell.key_[0]))
{
if (state & (state_mask_bit << 0))
{
traits_t::copy_value(v, cell.value_[0]);
// load_release_wrt_loads
unsigned const state2 = cell.state_;
if (state == state2)
return true;
else
goto retry;
}
else
{
// load_release_wrt_loads
unsigned const state2 = cell.state_;
if (state == state2)
return false;
else
goto retry;
}
}

if (item_count > 1)
{
if (traits_t::compare_key(k, cell.key_[1]))
{
if (state & (state_mask_bit << 1))
{
traits_t::copy_value(v, cell.value_[1]);
// load_release_wrt_loads
unsigned const state2 = cell.state_;
if (state == state2)
return true;
else
goto retry;
}
else
{
// load_release_wrt_loads
unsigned const state2 = cell.state_;
if (state == state2)
return false;
else
goto retry;
}
}

if (item_count > 2)
{
if (traits_t::compare_key(k, cell.key_[2]))
{
if (state & (state_mask_bit << 2))
{
traits_t::copy_value(v, cell.value_[2]);
// load_release_wrt_loads
unsigned const state2 = cell.state_;
if (state == state2)
return true;
else
goto retry;
}
else
{
// load_release_wrt_loads
unsigned const state2 = cell.state_;
if (state == state2)
return false;
else
goto retry;
}
}

add_item_t* aitem = cell.head_;
while (aitem)
{
if (traits_t::compare_key(k, aitem->key_))
{
traits_t::copy_value(v, aitem->value_);
// load_release_wrt_loads
unsigned const state2 = cell.state_;
if (state == state2)
return true;
else
goto retry;
}
aitem = aitem->next_;
}
}
}
}
unsigned const state2 = cell.state_; // load_release_wrt_loads
if (state == state2)
return false;
else
goto retry;
}

bool find_and_lock(key_in_t k, value_out_t v, handle_t& h)
{
size_t const hash = traits_t::hash(k);
block_hdr_t* b;
unsigned state;
cell_t& cell = lock_cell(hash, b, state);
unsigned const item_count = state & state_count_mask;
for (unsigned i = 0; i != item_count; ++i)
{
if (traits_t::compare_key(k, cell.key_[i]))
{
traits_t::copy_value(v, cell.value_[i]);
h.cell_ = &cell;
h.state_ = state;
h.value_ = &cell.value_[i];
return true;
}
}
add_item_t* aitem = cell.head_;
while (aitem)
{
if (traits_t::compare_key(k, aitem->key_))
{
traits_t::copy_value(v, aitem->value_);
h.cell_ = &cell;
h.state_ = state;
h.value_ = &aitem->value_;
return true;
}
aitem = aitem->next_;
}
unsigned const state2 = state & ~state_lock;
cell.state_ = state2; // memory_order_release
return false;
}

void update_and_unlock(value_in_t v, handle_t const& h)
{
traits_t::atomic_copy_value(*h.value_, v);
unsigned const state2 = (h.state_ & ~state_lock) + state_ver;
h.cell_->state_ = state2; // memory_order_release
}

void unlock(handle_t const& h)
{
unsigned const state2 = h.state_ & ~state_lock;
h.cell_->state_ = state2;
}

bool insert(key_in_t k, value_in_t v)
{
size_t const hash = traits_t::hash(k);
retry:
block_hdr_t* b;
unsigned state;
cell_t& cell = lock_cell(hash, b, state);
unsigned const item_count = state & state_count_mask;
for (unsigned i = 0; i != item_count; ++i)
{
if (traits_t::compare_key(k, cell.key_[i]))
{
unsigned const state2 = state & ~state_lock;
cell.state_ = state2; // memory_order_release
assert(0 == (cell.state_ & state_lock));
return false;
}
}
add_item_t* aitem = cell.head_;
while (aitem)
{
if (traits_t::compare_key(k, aitem->key_))
{
unsigned const state2 = state & ~state_lock;
cell.state_ = state2; // memory_order_release
assert(0 == (cell.state_ & state_lock));
return false;
}
aitem = aitem->next_;
}
if (item_count < cell_item_count)
{
traits_t::copy_key(cell.key_[item_count], k);
traits_t::copy_value(cell.value_[item_count], v);
unsigned const state2 = ((state & ~state_lock)
| (state_mask_bit << item_count))
+ state_count + state_ver;
cell.state_ = state2; // memory_order_release
assert(0 == (cell.state_ & state_lock));
return true;
}
else
{
add_item_t* aitem = alloc_add_item(b, hash);
if (0 == aitem)
{
grow(cell, state);
goto retry;
}
traits_t::copy_key(aitem->key_, k);
traits_t::copy_value(aitem->value_, v);
aitem->next_ = cell.head_;
cell.head_ = aitem; // memory_order_release
unsigned const state2 = ((state & ~state_lock)
| state_add_list) + state_ver;
cell.state_ = state2; // memory_order_release
}
assert(0 == (cell.state_ & state_lock));
return true;
}

bool insert_or_update(key_in_t k, value_in_t v)
{
size_t const hash = traits_t::hash(k);
retry:
block_hdr_t* b;
unsigned state;
cell_t& cell = lock_cell(hash, b, state);
unsigned const item_count = state & state_count_mask;
for (unsigned i = 0; i != item_count; ++i)
{
if (traits_t::compare_key(k, cell.key_[i]))
{
traits_t::atomic_copy_value(cell.value_[i], v);
unsigned const state2 =
(state & ~state_lock) + state_ver;
cell.state_ = state2; // memory_order_release
return false;
}
}
add_item_t* aitem = cell.head_;
while (aitem)
{
if (traits_t::compare_key(k, aitem->key_))
{
traits_t::atomic_copy_value(aitem->value_, v);
unsigned const state2 =
(state & ~state_lock) + state_ver;
cell.state_ = state2; // memory_order_release
return false;
}
aitem = aitem->next_;
}
if (item_count < cell_item_count)
{
traits_t::copy_key(cell.key_[item_count], k);
traits_t::copy_value(cell.value_[item_count], v);
unsigned const state2 = ((state & ~state_lock)
| (state_mask_bit << item_count))
+ state_count + state_ver;
cell.state_ = state2; // memory_order_release
return true;
}
else
{
add_item_t* aitem = alloc_add_item(b, hash);
if (0 == aitem)
{
grow(cell, state);
goto retry;
}
traits_t::copy_key(aitem->key_, k);
traits_t::copy_value(aitem->value_, v);
aitem->next_ = cell.head_;
cell.head_ = aitem; // memory_order_release
if (aitem->next_)
{
unsigned const state2 =
state & ~state_lock) + state_ver;
cell.state_ = state2; // memory_order_release
}
else
{
unsigned const state2 =
((state & ~state_lock) | state_add_list)
+ state_ver;
cell.state_ = state2; // memory_order_release
}
}
return true;
}

bool remove(key_in_t k)
{
size_t const hash = traits_t::hash(k);
block_hdr_t* b;
unsigned state;
cell_t& cell = lock_cell(hash, b, state);
unsigned const item_count = state & state_count_mask;
for (unsigned i = 0; i != item_count; ++i)
{
if (traits_t::compare_key(k, cell.key_[i]))
{
unsigned const state2 =
(state & ~(state_mask_bit << i)) + state_ver;
cell.state_ = state2; // memory_order_release
if (cell.head_)
{
add_item_t* aitem = cell.head_;
traits_t::copy_key(cell.key_[i], aitem->key_);
traits_t::copy_value
(cell.value_[i], aitem->value_);
unsigned const state3 = state + 2 * state_ver;
cell.state_ = state3; // memory_order_release
add_item_t* aitem_next = aitem->next_;
cell.head_ = aitem_next;
if (aitem_next)
{
unsigned const state4 =
(state & ~state_lock) + 3 * state_ver;
cell.state_ = state4; // memory_order_release
}
else
{
unsigned const state4 =
((state & ~state_lock) & ~state_add_list)
+ 3 * state_ver;
cell.state_ = state4; // memory_order_release
}
free_add_item(aitem);
}
else if (i != item_count - 1)
{
traits_t::copy_key(cell.key_[i],
cell.key_[item_count - 1]);
traits_t::copy_value(cell.value_[i],
cell.value_[item_count - 1]);
unsigned state3 = (state & ~state_lock)
- state_count + state_ver;
cell.state_ = state3; // memory_order_release
}
else
{
unsigned state3 =
(state & ~state_lock)
- state_count + state_ver;
cell.state_ = state3; // memory_order_release
}
assert(0 == (cell.state_ & state_lock));
return true;
}
}
add_item_t* volatile* aitem_prev = &cell.head_;
add_item_t* aitem = *aitem_prev;
while (aitem)
{
if (traits_t::compare_key(k, aitem->key_))
{
add_item_t* aitem_next = aitem->next_;
*aitem_prev = aitem_next;
if (cell.head_)
{
unsigned state2 =
(state & ~state_lock) + state_ver;
cell.state_ = state2; // memory_order_release
}
else
{
unsigned state2 = ((state & ~state_lock)
& ~state_add_list) + state_ver;
cell.state_ = state2; // memory_order_release
}
free_add_item(aitem);
assert(0 == (cell.state_ & state_lock));
return true;
}
aitem_prev = &aitem->next_;
aitem = *aitem_prev;
}
unsigned const state2 = state & ~state_lock;
cell.state_ = state2; // memory_order_release
assert(0 == (cell.state_ & state_lock));
return false;
}

private:
block_hdr_t* volatile anchor_;
unsigned volatile resizing_;

block_hdr_t* create_block(size_t cell_count)
{
size_t add_cell_count =
cell_count / cell_to_add_cell_ratio;
size_t size =
sizeof(block_hdr_t) + sizeof(cell_t) * cell_count
+ sizeof(add_cell_t) * add_cell_count;
void* mem = _aligned_offset_malloc
(size, cacheline_size, sizeof(block_hdr_t));
block_hdr_t* b = new (mem) block_hdr_t;
b->cell_mask_ = cell_count - 1;
b->cell_count_ = cell_count;
b->add_cell_count_ = add_cell_count;
b->cell_ = (cell_t*)((char*)b + sizeof(block_hdr_t));
b->add_cell_ = (add_cell_t*)((char*)b + sizeof(block_hdr_t)
+ sizeof(cell_t) * cell_count);
b->mem_ = mem;
cell_t* cell = (cell_t*)((char*)mem + sizeof(block_hdr_t));
// for now support only for pod types
// if block is very big than use non-temporal stores
memset(cell, 0, size - sizeof(block_hdr_t));
return b;
}

static void block_dtor(pcx_node* n)
{
_aligned_free(static_cast<block_hdr_t*>(n)->mem_);
}

add_item_t* alloc_add_item(block_hdr_t* b, size_t hash)
{
size_t const add_cell_count = b->add_cell_count_;
for (size_t iter = 0; iter != 2; ++iter)
{
for (size_t idx = 0; idx != add_cell_count; ++idx)
{
size_t const add_cell_idx =
(hash + idx) % add_cell_count;
add_cell_t& add_cell = b->add_cell_[add_cell_idx];
for (;;)
{
if (state_lock != _InterlockedExchange
((long*)&add_cell.lock_, state_lock))
break;
SwitchToThread();
}
if (add_cell.head_)
{
add_item_t* aitem = add_cell.head_;
add_cell.head_ = aitem->next_;
add_cell.lock_ = 0;
return aitem;
}
add_cell.lock_ = 0;
}
}
return 0;
}

void free_add_item(add_item_t* item)
{
add_cell_t& add_cell =
*(add_cell_t*)((intptr_t)item
- (intptr_t)item % sizeof(add_cell_t));
for (;;)
{
if (state_lock != _InterlockedExchange
((long*)add_cell.lock_, state_lock))
break;
SwitchToThread();
}
item->next_ = add_cell.head_;
add_cell.head_ = item;
add_cell.lock_ = 0;
}

__forceinline
cell_t& lock_cell
(hash_t hash, block_hdr_t*& block, unsigned& state)
{
for (;;)
{
block_hdr_t* b = anchor_; // memory_order_consume
size_t const cell_idx = hash & b->cell_mask_;
cell_t& cell = b->cell_[cell_idx];
unsigned const st = cell.state_;
if (st & state_lock)
{
SwitchToThread();
continue;
}
if (st == (unsigned)_InterlockedCompareExchange
((long*)&cell.state_, st | state_lock, st))
{
block = b;
state = st | state_lock;
return cell;
}
}
}

void grow(cell_t& cell, unsigned state)
{
unsigned const already_resizing =
(unsigned)_InterlockedExchange
(long*)&resizing_, state_lock);
unsigned const state2 = state & ~state_lock;
cell.state_ = state2; // memory_order_release
if (already_resizing)
{
while (resizing_)
{
SwitchToThread();
}
}
else
{
block_hdr_t* b = anchor_;
size_t const cell_count = b->cell_count_;
block_hdr_t* new_block = create_block(cell_count * 4);
for (size_t i = 0; i != cell_count; ++i)
{
block_hdr_t* bb;
unsigned state;
lock_cell(i, bb, state);
}
for (size_t cell_idx = 0; cell_idx != cell_count;
++cell_idx)
{
cell_t& cell = b->cell_[cell_idx];
unsigned const state = cell.state_;
unsigned const item_count = state & state_count_mask;
for (unsigned i = 0; i != item_count; ++i)
{
hash_t hash = traits_t::hash(cell.key_[i]);
cell_t& new_cell = new_block->cell_
[hash & new_block->cell_mask_];
unsigned new_cell_count =
new_cell.state_ & state_count_mask;
traits_t::copy_key(
new_cell.key_[new_cell_count],
cell.key_[i]);
traits_t::copy_value(
new_cell.value_[new_cell_count],
cell.value_[i]);
new_cell.state_ = new_cell.state_ + state_count
+ (state_mask_bit << new_cell_count);
}
add_item_t* aitem = cell.head_;
while (aitem)
{
hash_t hash = traits_t::hash(aitem->key_);
cell_t& new_cell = new_block->cell_
[hash & new_block->cell_mask_];
unsigned new_cell_count =
new_cell.state_ & state_count_mask;
if (new_cell_count < cell_item_count)
{
traits_t::copy_key(new_cell.key_
[new_cell_count], aitem->key_);
traits_t::copy_value(new_cell.value_
[new_cell_count], aitem->value_);
new_cell.state_ = new_cell.state_
+ state_count
+ (state_mask_bit << new_cell_count);
}
else
{
add_item_t* new_aitem =
alloc_add_item(new_block, hash);
assert(new_aitem);
traits_t::copy_key
(new_aitem->key_, aitem->key_);
traits_t::copy_value
(new_aitem->value_, aitem->value_);
new_aitem->next_ = new_cell.head_;
new_cell.head_ = new_aitem;
new_cell.state_ =
new_cell.state_ | state_add_list;
}
aitem = aitem->next_;
}
}
anchor_ = new_block; // memory_order_release
resizing_ = 0;
pcx_thread& th = pcx_thread::get();
th.defer(b, &concurrent_hash_map::block_dtor);
th.promote();
}
}

concurrent_hash_map(concurrent_hash_map const&);
concurrent_hash_map& operator = (concurrent_hash_map const&);
};

}

/************************* pcx.h *************************/

#pragma once

#include <intrin.h>
#pragma intrinsic (_InterlockedExchangeAdd)
#pragma intrinsic (_InterlockedCompareExchange)

namespace rl
{

size_t const cacheline_size = 64;

struct pcx_node
{
typedef void (*pcx_dtor_t)(pcx_node*);
pcx_node* pcx_next_;
pcx_dtor_t pcx_dtor_;
};

namespace pcx_int
{
unsigned const word_bits = 32;
unsigned const collector_bits = 4;
unsigned const collector_count = 1 << collector_bits;
unsigned const counter_inc = 1 << (collector_bits * 2);
unsigned const is_current_inc = 1;
unsigned const back_link_inc = 2;

struct master;
struct collector;

struct local_collector
{
pcx_node* defer_head_;
pcx_node defer_tail_;
unsigned defer_size_;
};

struct thread_int
{
pcx_int::master* master_;
pcx_int::collector* collectors_;
unsigned recursion_count_;
unsigned is_acquired_;
unsigned collector_index_;
unsigned last_seen_collector_index_;
unsigned flush_tail_;
pcx_node* defer_head_;
pcx_node defer_tail_;
unsigned defer_size_;
unsigned promote_;
local_collector local_collectors_ [collector_count];
};
}

class pcx_thread : private pcx_int::thread_int
{
public:
static pcx_thread& get();

void acquire();
void release();
void defer(pcx_node* node, pcx_node::pcx_dtor_t dtor);
void flush();
void promote();
void quiescent();

private:
void init();
void deinit();
unsigned acquire_impl();
void release_impl(unsigned, unsigned);
void flush_impl();
void local_flush();
void quiescent_impl();
friend void init();
friend void deinit();
friend void thread_callback(bool);
};

namespace pcx_int
{
struct master
{
char pad0_ [64];

unsigned garbage_threshold_;

char pad1_ [64];

struct state_part
{
unsigned current_collector_ : collector_bits;
unsigned collector_tail_ : collector_bits;
unsigned outer_counter_ :
word_bits - 2 * collector_bits;
};

union state
{
long whole_;
state_part part_;
};

state state_;

char pad2_ [64];

state state_copy_;

char pad3_ [64];
};

struct collector
{
char pad0_ [64];

pcx_node* defer_list_head_;
unsigned defer_list_size_;

char pad1_ [64];

struct state_part
{
unsigned is_current_ : 1;
unsigned back_link_ : 1;
unsigned pad_ : collector_bits * 2 - 2;
unsigned inner_counter_ :
word_bits - 2 * collector_bits;
};

union state
{
long whole_;
state_part part_;
};

state state_;

char pad2_ [64];
};

__declspec(selectany)
master g_master;
__declspec(selectany)
collector g_collectors [collector_count];
__declspec(selectany, thread)
thread_int g_thread_instance;

typedef void (__stdcall nt_tls_cb_t)(void*, unsigned long, void*);
nt_tls_cb_t on_tls_callback;

#pragma data_seg(push, old_seg)
#pragma data_seg(".CRT$XLB")
__declspec(selectany, dllexport)
nt_tls_cb_t* volatile p_thread_callback = on_tls_callback;
#pragma data_seg(pop, old_seg)

inline void __stdcall on_tls_callback
(void*, unsigned long reason, void*)
{
if (1 == reason)
{
init();
thread_callback(true);
}
else if (0 == reason)
{
thread_callback(false);
deinit();
}
if (2 == reason)
{
thread_callback(true);
}
else if (3 == reason)
{
thread_callback(false);
}
}
}

inline void init()
{
using namespace pcx_int;
master& m = g_master;
m.garbage_threshold_ = 128;
m.state_.part_.current_collector_ = 0;
m.state_.part_.collector_tail_ = 0;
m.state_.part_.outer_counter_ = 0;
m.state_copy_.part_.current_collector_ = 0;
m.state_copy_.part_.collector_tail_ = 0;
m.state_copy_.part_.outer_counter_ = 0;
for (unsigned i = 0; i != collector_count; ++i)
{
collector& c = g_collectors[i];
c.defer_list_head_ = 0;
c.defer_list_size_ = 0;
c.state_.part_.is_current_ = 1;
c.state_.part_.back_link_ = 1;
c.state_.part_.inner_counter_ = 0;
}
g_collectors[0].state_.part_.back_link_ = 0;
}

inline void deinit()
{
using namespace pcx_int;
pcx_thread::get().release_impl(
g_master.state_.part_.current_collector_, is_current_inc);
}

inline void thread_callback(bool init)
{
if (init)
pcx_thread::get().init();
else
pcx_thread::get().deinit();
}

inline pcx_thread& pcx_thread::get()
{
return static_cast<pcx_thread&>(pcx_int::g_thread_instance);
}

inline unsigned pcx_thread::acquire_impl()
{
using namespace pcx_int;
long const prev =
_InterlockedExchangeAdd(
&master_->state_.whole_, counter_inc);
master::state_part u = {prev};
if (u.current_collector_ == flush_tail_
&& local_collectors_[flush_tail_].defer_size_)
{
local_flush();
}

return u.current_collector_;
}

inline void pcx_thread::release_impl(unsigned index, unsigned count)
{
using namespace pcx_int;
collector& c = collectors_[index];
unsigned const prev =
_InterlockedExchangeAdd(
&c.state_.whole_, (unsigned)-(int)count);
if (0 == prev - count)
{
pcx_node* curr = c.defer_list_head_;
while (curr)
{
pcx_node* next = curr->pcx_next_;
curr->pcx_dtor_(curr);
curr = next;
}
c.defer_list_head_ = 0;
c.defer_list_size_ = 0;
c.state_.part_.back_link_ = 1;
c.state_.part_.is_current_ = 1;

long u;
if (index != collector_count - 1)
u = collector_count;
else
u = -(long)(collector_count * (collector_count - 1));
_InterlockedExchangeAdd(&master_->state_.whole_, u);

release_impl((index + 1) % collector_count, back_link_inc);
}
}

inline void pcx_thread::flush_impl()
{
using namespace pcx_int;
_mm_mfence();
master::state state = master_->state_;
last_seen_collector_index_ = state.part_.current_collector_;
collector& gc = collectors_[state.part_.current_collector_];
local_collector& lc = local_collectors_
[state.part_.current_collector_];
lc.defer_head_->pcx_next_ = defer_tail_.pcx_next_;
lc.defer_head_ = defer_tail_.pcx_next_;
lc.defer_size_ += defer_size_;
defer_head_ = &defer_tail_;
defer_tail_.pcx_next_ = 0;
defer_size_ = 0;
if (master_->garbage_threshold_ < lc.defer_size_ || promote_)
{
master::state cmp;
master::state val;
do
{
cmp = master_->state_;
if (cmp.part_.current_collector_ !=
last_seen_collector_index_)
{
promote_ = 0;
return;
}
unsigned next_index =
(last_seen_collector_index_ + 1) % collector_count;
if (cmp.part_.collector_tail_ == next_index)
return;
val = cmp;
val.part_.current_collector_ += 1;
val.part_.outer_counter_ = 0;
}
while (cmp.whole_ != _InterlockedCompareExchange(
(long*)&master_->state_.whole_, val.whole_, cmp.whole_));
last_seen_collector_index_ = val.part_.current_collector_;
promote_ = 0;
_InterlockedIncrement((long*)&master_->state_copy_.whole_);
_InterlockedExchangeAdd((long*)&gc.state_.whole_,
cmp.part_.outer_counter_ * counter_inc - is_current_inc);
}
}

__declspec(noinline)
inline void pcx_thread::local_flush()
{
using namespace pcx_int;
if (flush_tail_ == master_->state_.part_.collector_tail_)
return;
local_collector& lc = local_collectors_[flush_tail_];
pcx_node* curr = lc.defer_tail_.pcx_next_;
while (curr)
{
pcx_node* next = curr->pcx_next_;
curr->pcx_dtor_(curr);
curr = next;
}
lc.defer_head_ = &lc.defer_tail_;
lc.defer_tail_.pcx_next_ = 0;
lc.defer_size_ = 0;
flush_tail_ = (flush_tail_ + 1) % collector_count;
}

__declspec(noinline)
inline void pcx_thread::quiescent_impl()
{
using namespace pcx_int;
if (defer_size_)
flush_impl();
release_impl(collector_index_, counter_inc);
collector_index_ = acquire_impl();
}

inline void pcx_thread::acquire()
{
using namespace pcx_int;
recursion_count_ += 1;
if (1 != recursion_count_)
return;
if (is_acquired_)
return;
collector_index_ = acquire_impl();
last_seen_collector_index_ = collector_index_;
is_acquired_ = 1;
}

inline void pcx_thread::release()
{
using namespace pcx_int;
recursion_count_ -= 1;
if (0 == recursion_count_)
{
if (master_->state_copy_.part_.current_collector_ !=
collector_index_
|| promote_)
{
if (defer_size_)
flush_impl();
release_impl(collector_index_, counter_inc);
is_acquired_ = 0;
}
}
if (flush_tail_ != last_seen_collector_index_)
{
local_flush();
}
}

inline void pcx_thread::quiescent()
{
if (master_->state_copy_.part_.current_collector_ !=
collector_index_
|| promote_)
{
quiescent_impl();
}
if (flush_tail_ != last_seen_collector_index_)
{
local_flush();
}
}

inline void pcx_thread::defer
(pcx_node* node, pcx_node::pcx_dtor_t dtor)
{
using namespace pcx_int;
node->pcx_next_ = 0;
node->pcx_dtor_ = dtor;
defer_head_->pcx_next_ = node;
defer_head_ = node;
defer_size_ += 1;
}

inline void pcx_thread::flush()
{
using namespace pcx_int;
if (recursion_count_)
return;
if (0 == is_acquired_)
return;
if (defer_size_)
flush_impl();
release_impl(collector_index_, counter_inc);
is_acquired_ = 0;
}

inline void pcx_thread::promote()
{
promote_ = 1;
}

inline void pcx_thread::init()
{
using namespace pcx_int;
master_ = &g_master;
collectors_ = g_collectors;
defer_head_ = &defer_tail_;
defer_tail_.pcx_next_ = 0;
for (unsigned i = 0; i != collector_count; ++i)
{
local_collectors_[i].defer_head_ =
&local_collectors_[i].defer_tail_;
}
}

inline void pcx_thread::deinit()
{
flush();
}

}

/********************** hash_map.cpp **********************/

#include "pcx.h"
#include "hash_map.h"

int main()
{
rl::concurrent_hash_map<int, int> m;

rl::pcx_thread& th = rl::pcx_thread::get();
th.acquire();

// insert
for (int i = 0; i != 3; ++i)
{
bool inserted = m.insert(i, i + 1);
th.quiescent();
}

// find
for (int i = 0; i != 3; ++i)
{
int value;
bool found = m.find(i, value);
if (found)
{
// use value
}
th.quiescent();
}

// update or insert
for (int i = 0; i != 3; ++i)
{
bool inserted = m.insert_or_update(i, i + 2);
th.quiescent();
}

// update inplace
for (int i = 0; i != 3; ++i)
{
int value;
rl::concurrent_hash_map<int, int>::handle_t h;
bool found = m.find_and_lock(i, value, h);
if (found)
{
if (rand() % 2)
m.update_and_unlock(value + 100, h);
else
m.unlock(h);
}
th.quiescent();
}

th.release();

Message has been deleted

climb...@gmail.com

unread,

Sep 13, 2008, 8:47:36 PM9/13/08

to

On Sep 12, 3:47 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>
> Here are benchmark results on Q6600.
> Element count = 256, readonly workload, 8 threads.
> TBB concurrent_hash_map:
> Single core: 338 cycles/op
> Cores 0&1: 436 cycles/op (scaling 1.55)
> Cores 0&2: 1080 cycles/op (scaling 0.63)
> All cores: 1412 cycles/op (scaling 0.96)
>

Hi, I am also trying to test some multithreaded programs, but I don't
know what tool(s) to use to get the cpu cycle measurement. I really
would like to know how you obtain the benmark result.

tony

Dmitriy V'jukov

unread,

Sep 15, 2008, 7:16:01 AM9/15/08

to

On Sep 14, 4:47 am, climber....@gmail.com wrote:

> > Here are benchmark results on Q6600.
> > Element count = 256, readonly workload, 8 threads.
> > TBB concurrent_hash_map:
> > Single core: 338 cycles/op
> > Cores 0&1: 436 cycles/op (scaling 1.55)
> > Cores 0&2: 1080 cycles/op (scaling 0.63)
> > All cores: 1412 cycles/op (scaling 0.96)
>
> Hi, I am also trying to test some multithreaded programs, but I don't
> know what tool(s) to use to get the cpu cycle measurement. I really
> would like to know how you obtain the benmark result.

I run simple benchmark for around 10 seconds. Every thread count
number of executed operations in thread local counter. After 10
seconds I stop all threads, and sum all thread local counters. And
then divide execution time measured in cycles by total number of
operations. Execution time I measure with rdtsc instruction.
Usually results very stable, i.e. deviation is no more than few
cycles.

Alexander Chemeris

unread,

Oct 1, 2008, 7:35:57 AM10/1/08

to

On Sep 15, 3:16 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Sep 14, 4:47 am, climber....@gmail.com wrote:
>
> > > Here are benchmark results on Q6600.
> > > Element count = 256, readonly workload, 8 threads.
> > > TBB concurrent_hash_map:
> > > Single core: 338 cycles/op
> > > Cores 0&1: 436 cycles/op (scaling 1.55)
> > > Cores 0&2: 1080 cycles/op (scaling 0.63)
> > > All cores: 1412 cycles/op (scaling 0.96)
>
> > Hi, I am also trying to test some multithreaded programs, but I don't
> > know what tool(s) to use to get the cpu cycle measurement. I really
> > would like to know how you obtain the benmark result.
>
> I run simple benchmark for around 10 seconds. Every thread count
> number of executed operations in thread local counter. After 10
> seconds I stop all threads, and sum all thread local counters. And
> then divide execution time measured in cycles by total number of
> operations. Execution time I measure with rdtsc instruction.
> Usually results very stable, i.e. deviation is no more than few
> cycles.

How do you get rid of the task switches and the time, spent in
other threads? Some people just assume that if task switch
occurred, then time, spent inside a calculation cycle deviate
a lot from the mean and just through away this timing.
Do you use the same technique?

Alexander Chemeris.

Dmitriy V'jukov

unread,

Oct 1, 2008, 7:58:51 AM10/1/08

to

On Oct 1, 3:35 pm, Alexander Chemeris <alexander.cheme...@gmail.com>
wrote:

> > I run simple benchmark for around 10 seconds. Every thread count
> > number of executed operations in thread local counter. After 10
> > seconds I stop all threads, and sum all thread local counters. And
> > then divide execution time measured in cycles by total number of
> > operations. Execution time I measure with rdtsc instruction.
> > Usually results very stable, i.e. deviation is no more than few
> > cycles.
>
> How do you get rid of the task switches and the time, spent in
> other threads? Some people just assume that if task switch
> occurred, then time, spent inside a calculation cycle deviate
> a lot from the mean and just through away this timing.
> Do you use the same technique?

I don't get rid of context switches and exterior load. Since I measure
around 10^9 operations, I don't think that it will have some effect.
Machine is maximally relieved of any exterior load while tests are
conducted. I think that bigger mistake will be introduced by measuring
time of every single operation.

Dmitriy V'jukov

Alexander Chemeris

unread,

Oct 2, 2008, 6:10:18 AM10/2/08

to

I think using rdtsc won't introduce too much distortion, even when
timing small pieces of code. You can look into ffmpeg's timing
facilities.
Ffmpeg guys really do care about performance and have developed
a good set of C macros for performance measurement.

But, with current long-pipline CPUs there is a problem that you can't
actually say that "this operation takes exactly N cycles". You have to
clarify is it a number of cycles from the start of the first
instruction
to the start of the last instruction, or from the start of the first
instruction
to the end of the last instruction. If I recall correctly, using rdtsc
for
measuring every operation will give you the latter. While measuring
the overall time will give you the former + loop overhead + context
switches overhead. Probably context switches overhead is negligible
because of the nature of experiment, but you can't get rid of loop
overhead (though it should be just few cycles).

So, in making story short - it would be interesting if you make
timings on per-operation basis and compare that results with your
previous results.

--
Regards,
Alexander Chemeris.

SIPez LLC.
SIP VoIP, IM and Presence Consulting
http://www.SIPez.com
tel: +1 (617) 273-4000

Dmitriy V'jukov

unread,

Oct 2, 2008, 6:32:41 AM10/2/08

to

On Oct 2, 2:10 pm, Alexander Chemeris <alexander.cheme...@gmail.com>
wrote:

> I think using rdtsc won't introduce too much distortion, even when

> timing small pieces of code. You can look into ffmpeg's timing
> facilities.
> Ffmpeg guys really do care about performance and have developed
> a good set of C macros for performance measurement.
>
> But, with current long-pipline CPUs there is a problem that you can't
> actually say that "this operation takes exactly N cycles". You have to
> clarify is it a number of cycles from the start of the first
> instruction
> to the start of the last instruction, or from the start of the first
> instruction
> to the end of the last instruction. If I recall correctly, using rdtsc
> for
> measuring every operation will give you the latter. While measuring
> the overall time will give you the former + loop overhead + context
> switches overhead. Probably context switches overhead is negligible
> because of the nature of experiment, but you can't get rid of loop
> overhead (though it should be just few cycles).
>
> So, in making story short - it would be interesting if you make
> timings on per-operation basis and compare that results with your
> previous results.

I agree that in the perfect world I have to make several extremely
precise measurements. But in real world I really don't have time to
make such things (I'm not a researcher, and nobody pays me for that).
Nevertheless I believe my measurements give precise results (but maybe
not very-very precise results). And I post source code, so everybody
is free to make any measurements he wants ;)
If we will take into consideration loop overhead and context switches
etc, well, maybe we will get 25 cycles/op instead of 30 cycles/op. I
think this is just inessential. This won't affect linear scaling of my
map, and super linear degradation of lock-based map.
As for rdtsc. When we are talking about things like 30 cycles, I think
rdtsc can introduce considerable distortion. Btw, when I was measuring
exact timings of some things I was using following scheme:

CPUID [eax=0]
t1 = RDTSC
CPUID [eax=0]
[tested code]
CPUID [eax=0]
t2 = RDTSC
CPUID [eax=0]
t = t2 - t1 - K

K must be determined experimentally so that t = 0 when [tested code]
is empty.
This allows one to measure timing down to individual instructions.

Dmitriy V'jukov

demo

unread,

Oct 17, 2008, 9:09:46 AM10/17/08

to

On Sep 12, 9:51 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Sep 12, 11:47 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
>

This crashes for me in main at the first th.acquire();

Looks like the master_ in pcx_thread is null. Should this be
initialized somehow?

--
Tore Halvorsen

Chris M. Thomasson

unread,

Oct 17, 2008, 3:45:07 PM10/17/08

to

"demo" <Tore.Ha...@gmail.com> wrote in message
news:ce1bcc30-6115-42c8...@m74g2000hsh.googlegroups.com...

Yes, it should be initialized. AFAICT, he is using a TLS hack in order to
get automatically notified of thread creating/destruction:
___________________________________________________________

typedef void (__stdcall nt_tls_cb_t)(void*, unsigned long, void*);
nt_tls_cb_t on_tls_callback;

#pragma data_seg(push, old_seg)
#pragma data_seg(".CRT$XLB")
__declspec(selectany, dllexport)
nt_tls_cb_t* volatile p_thread_callback = on_tls_callback;
#pragma data_seg(pop, old_seg)

___________________________________________________________

http://www.nynaeve.net/?tag=crt
(scroll to bottom of page for more information...)

This simply does not work on some compilers. For instance, I can't seem get
my version of the hack to work on MSVC 6 through 8. However, I think you can
just manually call this; try something like:

int main() {
rl::pcx_int::on_tls_callback(NULL, 1, NULL);
// [...];
rl::pcx_int::on_tls_callback(NULL, 0, NULL);
return 0;
}

That might do it... Give it a whirl.

;^D

Chris M. Thomasson

unread,

Oct 17, 2008, 3:56:11 PM10/17/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:2Q5Kk.186$TX6...@newsfe06.iad...

>
> "demo" <Tore.Ha...@gmail.com> wrote in message
> news:ce1bcc30-6115-42c8...@m74g2000hsh.googlegroups.com...
> On Sep 12, 9:51 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>> > On Sep 12, 11:47 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>> >
>> >
>
>
>> This crashes for me in main at the first th.acquire();
>
>> Looks like the master_ in pcx_thread is null. Should this be
>> initialized somehow?
>
> Yes, it should be initialized. AFAICT, he is using a TLS hack in order to
> get automatically notified of thread creating/destruction:
> ___________________________________________________________
> typedef void (__stdcall nt_tls_cb_t)(void*, unsigned long, void*);
> nt_tls_cb_t on_tls_callback;
>
> #pragma data_seg(push, old_seg)
> #pragma data_seg(".CRT$XLB")
> __declspec(selectany, dllexport)
> nt_tls_cb_t* volatile p_thread_callback = on_tls_callback;
> #pragma data_seg(pop, old_seg)
> ___________________________________________________________
>
>
> http://www.nynaeve.net/?tag=crt
> (scroll to bottom of page for more information...)
>
>
> This simply does not work on some compilers. For instance, I can't seem
> get my version of the hack to work on MSVC 6 through 8.

I just tried his version, and is does work on MSVC 2005 Express. However,
with wrt the code posted here there are some typos in the code as-is. I will
post corrections inline in a response to the post where the code resides.
Also, he might of fixed them here:

http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/60494
(you can download the hash-map from this post...)

Chris M. Thomasson

unread,

Oct 17, 2008, 4:03:28 PM10/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:7388da2a-c2f6-4979...@p25g2000hsf.googlegroups.com...

> On Sep 12, 11:47 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> I will post implementation in following post.

> Implementation is NOT PRODUCTION READY, it's just an illustration of
> the algorithm. Implementation can be compiled only with MSVC/Win32,
> but algorithm is fairly portable, only single-word atomic RMW
> operations are required, and very little of OS specific stuff.

There are two typos in here...

/*********************** hash_map.h ***********************/

[...]

> bool insert_or_update(key_in_t k, value_in_t v)
{

[...]

> if (aitem->next_)
> {
> unsigned const state2 =
> state & ~state_lock) + state_ver;

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Missing left-hand parenthesis.

>
> cell.state_ = state2; // memory_order_release
> }
> else
> {
> unsigned const state2 =
> ((state & ~state_lock) | state_add_list)
> + state_ver;
> cell.state_ = state2; // memory_order_release
> }
> }
> return true;
> }

> void grow(cell_t& cell, unsigned state)

> {
> unsigned const already_resizing =
> (unsigned)_InterlockedExchange
> (long*)&resizing_, state_lock);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Another missing left-hand parenthesis.

> unsigned const state2 = state & ~state_lock;
> cell.state_ = state2; // memory_order_release

I see you fixed these in the version you posted over on the Intel forums.

Dmitriy V'jukov

unread,

Oct 17, 2008, 6:46:46 PM10/17/08

to

On 17 окт, 17:09, demo <Tore.Halvor...@gmail.com> wrote:

> This crashes for me in main at the first th.acquire();
>
> Looks like the master_ in pcx_thread is null. Should this be
> initialized somehow?

Sorry for such stupid things!

Manual initialization as proposed by Chris must fix the problem.
The code must NOT be considered as production-ready in any case. I was
not able to port/test it on various compilers/platforms. Actually I
was able to spend on it only around a day.

But I hope you will be able to make it work.

Dmitriy V'jukov

unread,

Oct 17, 2008, 7:04:39 PM10/17/08

to

On 18 окт, 00:03, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

Actually I first post it to Intel forum, and then here few days after.
It seems that some symbols disappear when I was preparing the code for
posting.

Thank you for advice wrt PCX initialization.

Btw, do you run the code? Don't you notice any crashes? I suspect some
subtle error in this version of proxy-collector. Although I am not
sure. I don't have time to kick $hit from it with Relacy.

Incidentally, I am thinking about creating a set of standard Relacy
tests for some standard data-structures, like PDR, producer-consumer
queue etc.
I.e. for example set of tests for PDR which is distributed with
Relacy. You just parametrize the set with your implementation of PDR
and run the set. No need to invent and implement your own tests for
PDR. Something like:
#include "my_pdr_implementation.h"
int main() {
rl::standard_test_set_for_pdr<my_pdr_implementation>::run();
}
Unfortunately I don't have time to implement it right now.

Dmitriy V'jukov

Chris M. Thomasson

unread,

Oct 17, 2008, 7:59:48 PM10/17/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:77e155ae-eed2-4ebc...@v15g2000hsa.googlegroups.com...

On 18 окт, 00:03, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

[...]

> > I see you fixed these in the version you posted over on the Intel
> > forums.

> Actually I first post it to Intel forum, and then here few days after.
> It seems that some symbols disappear when I was preparing the code for
> posting.

> Thank you for advice wrt PCX initialization.

Sure. Well, if you get time to refine the code, and you want automatic
initialization, I would advise sticking it in a DLL and make use of DLLMain
on Windows, or put it in a shared lib and use GCC
__attribute__((constructor/destructor)) extensions on other platforms to
initialize the TSD key. However, on other systems, your not going to get
automatic TLS initialization like DLLMain provides; No problem, because you
can do something like:

pcx_thread* pcx_thread_self() {
pcx_thread* self = pthread_getspecific(...);
if (! self) {
self = new pcx_thread;
pthread_setspecific(..., self);
}
return self;
}

And use the callback provided to `pthread_key_create()' to dtor the
pcx_thread objects. vZOOM uses this approach and it works well.

> Btw, do you run the code? Don't you notice any crashes? I suspect some
> subtle error in this version of proxy-collector. Although I am not
> sure. I don't have time to kick $hit from it with Relacy.

:^D

I have only ran it a couple of times. Well, it has not crashed, but then
again the environments in which I ran it were only sufficient for me to step
through the code and get a feel for what's going on. I have non bombarded it
with concurrency yet. Also, if you think that there is problem in your
proxy-collector, well, you could always use the following for piece of mind:

http://webpages.charter.net/appcore/misc/pc_sample_h_v1.html

This algorithm has been blasted by Relacy, and it works (if you omit the
pc_defer function that is... ;^). IIRC, you provide my proxy algorithm in
your examples folder.

> Incidentally, I am thinking about creating a set of standard Relacy
> tests for some standard data-structures, like PDR, producer-consumer
> queue etc.
> I.e. for example set of tests for PDR which is distributed with
> Relacy. You just parametrize the set with your implementation of PDR
> and run the set. No need to invent and implement your own tests for
> PDR. Something like:
> #include "my_pdr_implementation.h"
> int main() {
> rl::standard_test_set_for_pdr<my_pdr_implementation>::run();
> }
> Unfortunately I don't have time to implement it right now.

That would be very convenient indeed!

Chris M. Thomasson

unread,

Oct 20, 2008, 12:09:50 AM10/20/08

to

IMVHO, a very nice way to measure real throughput and scalability is to
determine how many operations can happen on a per-thread-per-second/minute
basis. Think in terms of how many reads/searches can a reader thread perform
in a second/minute. Also, how many writes can a writer thread perform in a
second/minute. You will get interesting results indeed.

demo

unread,

Oct 20, 2008, 2:56:44 AM10/20/08

to

On Oct 17, 9:45 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
> That might do it... Give it a whirl.
>

I'll give it a spin. Thanks.

--
Tore Halvorsen

demo

unread,

Oct 20, 2008, 3:04:41 AM10/20/08

to

On Oct 18, 12:46 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> Sorry for such stupid things!
>
> Manual initialization as proposed by Chris must fix the problem.
> The code must NOT be considered as production-ready in any case.

I don't have that much expectations from code posted on usenet ;)

> I was not able to port/test it on various compilers/platforms.
> Actually I was able to spend on it only around a day.
>
> But I hope you will be able to make it work.

I'll try Chris' changes and see if I can get it up and running.

--
Tore Halvorsen

demo

unread,

Oct 20, 2008, 3:28:50 AM10/20/08

to

On Oct 17, 9:45 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Yes, it should be initialized. AFAICT, he is using a TLS hack in
> order to get automatically notified of thread
> creating/destruction:

> [...]

> This simply does not work on some compilers. For instance, I
> can't seem get my version of the hack to work on MSVC 6 through
> 8. However, I think you can just manually call this; try

> something like: [...]

Looks like the TLS hack works for threads other than main. Starting
the code in a thread doesn't have any problems.

--
Tore Halvorsen

Dmitriy V'jukov

unread,

Oct 20, 2008, 3:50:22 AM10/20/08

to

This is exactly how I measured performance:
http://groups.google.ru/group/comp.programming.threads/msg/0f2be213000e178f
I like this variant too.

Dmitriy V'jukov