For example I have multi-threaded server. Some "objects" are created,
passed through producer-consumer queues, stored in shared locations,
copied from shared locations and so on. All objects are independent.
I have to answer to simple question - when it's safe to delete
particular object.
Straightforward answer is plain old reference counting with
atomic_inc()/atomic_dec(). When reference count drops to zero, it's
safe to delete object.
I see two alternatives. First - reference counting based on message
passing. I.e. object has owner thread associated with it. When thread
wants to acquire/release reference to object he passes appropriate
message to owner thread. Here are some problems. Message passing must
be very effective. Messages have to be processed in order (I have to
use something like Lamport timestamps).
Second - GC-like algorithm. I.e. all objects stored in some global
(or per thread) list. This list is periodically scanned. And there
must be some means to detect when object is not used by any thread.
In this solution when thread wants to acquire/release reference to
object he can just modify some thread local variable. But it's
obvious that scanning will be expensive.
I'm looking for more scalable and more inexpensive solution than
plain old reference counting. Can somebody point to other
alternatives? Or provide some comments on alternatives I describe?
Dmitriy V'jukov
Boris
> Reference-counting can pose problems, for example: circular references.
> Garbage-collector code should be able to detect (handle) such cases.
> Otherwise, the application should handle such cases: by using 'weak'
> references when appropriate.
I perfectly understand it. As C++ developer I was living with it for
years and it completely satisfies me.
Dmitriy V'jukov
You can try something like this:
http://groups.google.com/group/comp.programming.threads/msg/ed54fb622f292197
I don't think it will work for strong thread-safety at all... I stuck some
logic in there to try and make it work; I think its nonsense. Here is a
pseudo-code sketch of the algorithm with all of that fuc%ing crap taken out:
_________________________________________________________________
struct object {
int rc;
void (*fp_dtor) (void*);
};
struct rc_req {
rc_req* next;
object* obj;
per_thread* owner;
};
struct per_thread {
/* links for registered thread list */
per_thread* next;
per_thread* prev;
vz_spsc_queue rc_cache; /* holds cached rc_req objects */
vz_spsc_queue rc_inc; /* holds increment requests */
vz_spsc_queue rc_dec; /* holds decrement requests */
};
struct collector {
/* list of registered threads */
per_thread* head;
per_thread* tail;
pthread_mutex_t mtx; /* thread list lock */
/* gathered inc/decrement requests */
rc_req* rc_inc;
rc_req* rc_dec;
};
/* request an increment */
void object_inc(object* const _this) {
per_thread* const _this = pthread_getspecific(...);
rc_req* const rc = vz_spsc_queue_pop(&_this->rc_cache);
rc->obj = _this;
rc->owner = _this;
vz_spsc_queue_push_nosig(&_this->rc_inc, rc);
}
/* request a decrement*/
void object_dec(object* const _this) {
per_thread* const _this = pthread_getspecific(...);
rc_req* const rc = vz_spsc_queue_pop(&_this->rc_cache);
rc->obj = _this;
rc->owner = _this;
vz_spsc_queue_push_nosig(&_this->rc_dec, rc);
}
/* the reference mutator thread */
void* collector_thread(void* state) {
collector* const _this = state;
per_thread* thread;
rc_req* rc;
for (;;) {
wait_for_gc_interval(...);
/* Gather up all of the counter requests */
pthread_mutex_lock(&_this->mtx);
thread = _this->head;
while (thread) {
/* grab increments */
rc = vz_spsc_queue_trypop(&thread->rc_inc);
while (rc) {
rc->next = _this->rc_inc;
_this->rc_inc = rc;
rc = vz_spsc_queue_trypop(&thread->rc_inc);
}
/* grab decrements */
rc = vz_spsc_queue_trypop(&thread->rc_dec);
while (rc) {
rc->next = _this->rc_dec;
_this->rc_dec = rc;
rc = vz_spsc_queue_trypop(&thread->rc_dec);
}
thread = thread->next;
}
pthread_mutex_unlock(&_this->mtx);
/* process all the increments _first_ */
rc = _this->rc_inc;
while (rc) {
_this->rc_inc = rc->next;
++rc->obj->rc;
vz_spsc_queue_push(&rc->owner->rc_cache, rc);
rc = _this->rc_inc;
}
/* process all the decrements _last_ */
rc = _this->rc_dec;
while (rc) {
_this->rc_dec = rc->next;
--rc->obj->rc;
if (! rc->obj->rc) {
/* time to call the object's dtor! */
rc->obj->fp_dtor(rc->obj);
}
vz_spsc_queue_push(&rc->owner->rc_cache, rc);
rc = _this->rc_dec;
}
}
}
_________________________________________________________________
That should allow for low-overhead distributed reference counting with
basic-thread safety...
What do ya think?
I think that you call both object* and per_thread* variables 'this_'
in object_inc()/object_dec() functions... :)
Also I think that user thread must have 2 rc_dec queues. 1 for
'current' epoch and 1 for 'next' epoch. And switch them when he
notices epoch shift. User thread always enqueues requests to 'next'
queue. And collector thread always processes 'current' epoch queues.
Otherwise collector thread can process decrement request before
acquire request, if those requests from different threads.
Or optionally collector thread can use OBJECT_FLAG_DTOR/dtor_final
technique.
Dmitriy V'jukov
> That should allow for low-overhead distributed reference counting with
> basic-thread safety...
>
> What do ya think?
I considered this variant. And I conclude that it will not play very
well in my environment.
Plain old reference counting: every acquire/release operation == cache
line transfer + atomic rmw.
Provided intrusive reference counting, every acquire/release operation
modifies cache line where object resides. This causes additional cache
line transfers.
Distributed reference counting based on message passing: every acquire/
release operation == cache line transfer to transfer acquire/release
request + cache line transfer to transfer rc_req object back + 2
enqueue operations + 2 dequeue operations + processing in collector
thread + cache line where user object resides constantly modified
causing additional cache line transfers.
Do you think it will play better?
And it is less prompted than plain old reference counting.
And it requires additional memory for rc_req object.
I can't say that this is superior solution... What do you think?
And I see additional problem here. Collector thread can be basically
saturated by requests from all other user threads. However this can be
solved by creating collector thread per processor and partitioning
user objects.
Concerning my environment. I have actor based framework. I.e. it's
heavily based on message passing, and every message is sent to
particular actor (c++ object). So before sending a message I have to
acquire references to message object and to actor object, and after
processing message I have to release references to message object and
to actor object. So it's 4 additional messages for every useful user
message. It will basically blow up entire system.
Dmitriy V'jukov
> Concerning my environment. I have actor based framework. I.e. it's
> heavily based on message passing, and every message is sent to
> particular actor (c++ object). So before sending a message I have to
> acquire references to message object and to actor object, and after
> processing message I have to release references to message object and
> to actor object. So it's 4 additional messages for every useful user
> message. It will basically blow up entire system.
Now I'm working on solution which is *not* based on message passing.
It must gracefully handle zillions of acquire/release operations :)
Dmitriy V'jukov
> Concerning my environment. I have actor based framework. I.e. it's
> heavily based on message passing, and every message is sent to
> particular actor (c++ object). So before sending a message I have to
> acquire references to message object and to actor object, and after
> processing message I have to release references to message object and
> to actor object. So it's 4 additional messages for every useful user
> message. It will basically blow up entire system.
I forgot that rc_req objects must be transferred back to user thread.
So it will be *8* additional helper messages for every single user
message.
Dmitriy V'jukov
I almost complete implementation. It will handle:
1. Reference counting with basic-thread safety
2. Reference counting with strong-thread safety
3. Reference counting with cross-thread references
4. Ultra low overhead PDR
5. PDR with long-term cross-thread references
I.e. basically any usage patterns.
I hope I will post implementation in near future.
Dmitriy V'jukov
So do I as I am interested to see how it completely differs from the
very-basic vZOOM distributed reference counting scheme...
:^)