How do you create an efficient _and_ scaleable multi-threaded allocator..

Chris Thomasson

unread,

Oct 21, 2007, 5:20:23 AM10/21/07

to

[comp.lang.c++ added because its going to standardize fine-grain, very
low-level multi-threading primitives; refer cpp-threads mailing list...]

I decoupled the allocation logic from the synchronization mechanism such
that single-threaded allocators can now be used in full-blown multi-threaded
environments. IMHO, it's a really neat feature. You could even set things up
where a thread A uses a different allocator than thread B, yet they are
still able to allocate; pass around and free blocks between themselves. Its
an extremely flexible framework.

This can be an important invention. Here is the basic sales pitch:

Question: How do you create a multi-threaded allocator, without creating a
multi-threaded allocator?

Answer: Create a single-threaded allocator and tell the vZOOM library to
multi-thread it for you...

Wow! :^)

Here is the "basic" vZOOM allocation scheme:

________________________
- create a thread-local instance of a user-defined single-threaded
allocator in every thread (e.g., ms heap w/ HEAP_NO_SERIALIZE).

- allocation requests are forwarded to this thread-local user allocator
directly.

- if free request goes from thread that allocated block (e.g., the origin
thread), then free request is forwarded to this thread-local user allocator.

- if free request goes from another thread, then you accumulate this block
in per-thread stack-based freelist "belonging to the origin thread", using
single atomic CAS.

- blocks from this freelist is actually reused/freed in batches using
single atomic SWAP when thread allocates/deallocates another block. For
instance, a thread that fails to allocate from its thread-local heap will do
a SWAP on the freelist and try and fulfill the allocation request from
there.
________________________

Any thoughts/comments/suggestions/rants/raves?

;^)

Bill Todd

unread,

Oct 21, 2007, 9:44:11 AM10/21/07

to

Describing how blocks eventually get deallocated from the free lists
back to the originating thread's heap (without requiring interlocks that
defeat your goal of not having them on such heaps) so that those heaps
don't become so fragmented that they become unusable might be nice. So
might a simulation to demonstrate that the free lists themselves don't
create increasingly unusable (effectively, fragmented) storage for their
temporary owners (surely you're not assuming that all block requests are
equal in size).

- bill

Chris Thomasson

unread,

Oct 22, 2007, 7:30:16 AM10/22/07

to

"Bill Todd" <bill...@metrocast.net> wrote in message
news:HcWdnZMPHa6BxYba...@metrocastcablevision.com...

> Chris Thomasson wrote:
>> [comp.lang.c++ added because its going to standardize fine-grain, very
>> low-level multi-threading primitives; refer cpp-threads mailing list...]
>>
>>
>> I decoupled the allocation logic from the synchronization mechanism such
>> that single-threaded allocators can now be used in full-blown
>> multi-threaded
>> environments.

[...]

>> Here is the "basic" vZOOM allocation scheme:
>>
>> ________________________

[...]

>> ________________________
>>
>>
>>
>> Any thoughts/comments/suggestions/rants/raves?
>
> Describing how blocks eventually get deallocated from the free lists back
> to the originating thread's heap (without requiring interlocks that defeat
> your goal of not having them on such heaps) so that those heaps don't
> become so fragmented that they become unusable might be nice. So might a
> simulation to demonstrate that the free lists themselves don't create
> increasingly unusable (effectively, fragmented) storage for their
> temporary owners (surely you're not assuming that all block requests are
> equal in size).

I use the per-thread user-provided allocator, to allocate a simple
per-thread segregated array of slabs. Something like:

thread::slab[0].sz = sizeof(void*);
thread::slab[1].sz = thread::slab[0].sz * 2;
thread::slab[2].sz = thread::slab[1].sz * 2;
thread::slab[3].sz = thread::slab[2].sz * 2;
thread::slab[4].sz = thread::slab[3].sz * 2;

Each slab is aligned on a common boundary value such that you can round an
address of one of its blocks down to that boundary in order to get at its
anchor/header. Each slab anchor has a freelist; thread-id and blocks of
memory.

- When a thread deallocates a block it rounds down to get at its slab
anchor; compares its thread-id with its own and places it into the slab's
freelist via. CAS if they do not match.

- A thread allocates by locating the right slab based on the requested size;
try to get block from that slab; if its empty, it does a SWAP on the slabs
freelist and tries to get the block from there. If that fails, it then tries
to allocate directly from the user-allocator.

Chris Thomasson

unread,

Oct 22, 2007, 10:23:36 AM10/22/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:rLGdnersrM0DFYHa...@comcast.com...

> "Bill Todd" <bill...@metrocast.net> wrote in message
> news:HcWdnZMPHa6BxYba...@metrocastcablevision.com...

[...]

>>> Any thoughts/comments/suggestions/rants/raves?
>>
>> Describing how blocks eventually get deallocated from the free lists back
>> to the originating thread's heap (without requiring interlocks that
>> defeat your goal of not having them on such heaps) so that those heaps
>> don't become so fragmented that they become unusable might be nice. So
>> might a simulation to demonstrate that the free lists themselves don't
>> create increasingly unusable (effectively, fragmented) storage for their
>> temporary owners (surely you're not assuming that all block requests are
>> equal in size).
>
> I use the per-thread user-provided allocator, to allocate a simple
> per-thread segregated array of slabs. Something like:
>

[...]

The user can request vZOOM to bypass the segregated slabs and forward to the
user-defined allocator directly.

Blocks are deallocated from the free-list directly back to the user-defined
allocator, in this case. Something like:

__________________________________________
void* malloc(size_t sz) {
per_thread* const _this = pthread_getspecific(...);
void* buf = _this->fp_user_defined_malloc(sz);
if (! buf) {
block* blk = SWAP(&_this->freelist, 0);
while(blk) {
block* const next = blk->next;
_this->fp_user_defined_free(blk);
blk = next;
}
buf = _this->fp_user_defined_malloc(sz);
}
return buf;
}
__________________________________________

Does that answer your question?