transforming single-threaded allocators...

Chris M. Thomasson

unread,

Oct 8, 2008, 12:32:07 AM10/8/08

to

I have developed a non-blocking algorithm that allows one to transform an
inherently single-threaded memory allocator into a scaleable multi-threaded
allocator. This invention allows a programmer to just create an allocator
algorithm without worrying about any multi-threading issues. Once your done
with your single-threaded allocator, well, when you apply my non-blocking
algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that
cool or what?

Okay, here is an initial proof of concept, I have transformed the Windows
Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the
internal lock used by the heap, which makes it a heck of a lot more
efficient. However, since the heap is no longer serialized, one cannot use
it in a multi-threaded application as a general allocator; until now... The
algorithm I created is fairly simple; basically, it goes like this:
___________________________________________________________________________
I create thread local user allocator in every thread.

Alloc request is forwarded to this thread-local user allocator directly.

If free request goes from thread that allocate block, then free request is
forwarded to this thread-local user allocator.

If free request goes from another thread, then you accumulate this block in
per-thread stack-based freelist.

Blocks from this freelist are actually freed in batches when thread
allocates/deallocates another block.
___________________________________________________________________________

I mentioned this algorithm a while back:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84

Well, I have released a pre-alpha version of it which demonstrates how to
transform the Windows Heap object under GPL license; the example source-code
can be found here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

Can you please run the test as-is, which uses stock malloc impl, and record
the output... Then uncomment the following line:

#define USE_MEM_ALLOC

which engages my algorithm; run the test again; and post the differences in
timings between the two. I think you will be fairly surprised at the
performance enhancement. The lock in the Windows Heap object that `malloc'
is based on severely damages its scalability when used in a multi-threaded
environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed
and the heaps are created on a per-thread basis. This distribution and
clever support algorithm I invented allows for the Windows Heap to scale
quite nicely indeed.

I am going to tweak the algorithm such that one could plug in different
allocators on a per-thread basis! In other words, thread A can use a
different allocation algorithm than thread B, and so on. My algorithm
actually supports this. I think the possibilities are truly amazing. It even
allows thread A to allocate and free in thread B even though they use
completely different allocation algorithm. That's pretty darn cool. Some may
think its even impossible; its not!

BTW, I am thinking that this algorithm might be able to be used to transform
other single-threaded things besides allocators... Humm, need to think some
more on that.

Chris M. Thomasson

unread,

Oct 8, 2008, 12:36:24 AM10/8/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:ACWGk.2030$982...@newsfe02.iad...

>I have developed a non-blocking algorithm that allows one to transform an
>inherently single-threaded memory allocator into a scaleable multi-threaded
>allocator. This invention allows a programmer to just create an allocator
>algorithm without worrying about any multi-threading issues. Once your done
>with your single-threaded allocator, well, when you apply my non-blocking
>algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is
>that cool or what?

[...]

> I mentioned this algorithm a while back:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84
>
>
> Well, I have released a pre-alpha version of it which demonstrates how to
> transform the Windows Heap object under GPL license; the example
> source-code can be found here:
>
> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html
>
>
> Can you please run the test as-is

The test portion of the code uses the pthread-win32 library which can be
downloaded from here:

http://sourceware.org/pthreads-win32

I am thinking of removing dependence on it and just use pure Windows
primitives.

[...]

Chris M. Thomasson

unread,

Oct 8, 2008, 1:26:09 AM10/8/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:ACWGk.2030$982...@newsfe02.iad...

>I have developed a non-blocking algorithm that allows one to transform an
>inherently single-threaded memory allocator into a scaleable multi-threaded
>allocator. This invention allows a programmer to just create an allocator
>algorithm without worrying about any multi-threading issues. Once your done
>with your single-threaded allocator, well, when you apply my non-blocking
>algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is
>that cool or what?

[...]

> Well, I have released a pre-alpha version of it which demonstrates how to
> transform the Windows Heap object under GPL license; the example
> source-code can be found here:
>
> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

I posted the WRONG VERSION! I did not include the reclaim threshold
experimental feature I am playing around with. Also, there was a bug in the
following function `mem_thread_sys_reclaim()'. Here was the original code:

LONG
mem_thread_sys_reclaim(
union mem_block* const reset
) {
LONG count = 0;
union mem_block* mblk = XCHG(&g_mem_tls->rfree, reset);
while (mblk) {
union mem_block* const next = mblk->next;
if (! HeapFree(g_mem_tls->heap, HEAP_NO_SERIALIZE, mblk)) {
assert(0); abort();
}
++count;
mblk = next;
}
if (! reset) {
g_mem_tls->refs += count;
}
return count;
}

Can you spot the bug? Well, IMVHO, its kind of easy... See, the reclaim
function is suppose to free the blocks that have been deferred by remote
threads... Well, a free is suppose to actually decrement the owner thread
reference count... I was incrementing it! OUCH!

;^(...

Anyway, its fixed on the web-site. I needed to change the code:

if (! reset) {
g_mem_tls->refs += count;
}

to:

if (! reset) {
g_mem_tls->refs -= count;
}

So, you should re-download the source-code! Sorry for any trouble.

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

Anyway, I am not sure how the reclaim threshold feature is going to effect
performance. Humm... Need to think about this.

Chris M. Thomasson

unread,

Oct 8, 2008, 1:30:07 AM10/8/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:fpXGk.2265$982....@newsfe02.iad...

> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> news:ACWGk.2030$982...@newsfe02.iad...
>>I have developed a non-blocking algorithm that allows one to transform an
>>inherently single-threaded memory allocator into a scaleable
>>multi-threaded allocator. This invention allows a programmer to just
>>create an allocator algorithm without worrying about any multi-threading
>>issues. Once your done with your single-threaded allocator, well, when you
>>apply my non-blocking algorithm, and PRESTO! Its now a scaleable
>>multi-threaded allocator! Is that cool or what?
> [...]
>> Well, I have released a pre-alpha version of it which demonstrates how to
>> transform the Windows Heap object under GPL license; the example
>> source-code can be found here:
>>
>> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html
>
> I posted the WRONG VERSION!

BTW, in case anybody is interested, here is the first time I made the base
non-blocking algorithm itself public:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69

Chris M. Thomasson

unread,

Oct 8, 2008, 1:47:43 AM10/8/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:fpXGk.2265$982....@newsfe02.iad...

> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> news:ACWGk.2030$982...@newsfe02.iad...
>>I have developed a non-blocking algorithm that allows one to transform an
>>inherently single-threaded memory allocator into a scaleable
>>multi-threaded allocator. This invention allows a programmer to just
>>create an allocator algorithm without worrying about any multi-threading
>>issues. Once your done with your single-threaded allocator, well, when you
>>apply my non-blocking algorithm, and PRESTO! Its now a scaleable
>>multi-threaded allocator! Is that cool or what?
> [...]
>> Well, I have released a pre-alpha version of it which demonstrates how to
>> transform the Windows Heap object under GPL license; the example
>> source-code can be found here:
>>
>> http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html
>
> I posted the WRONG VERSION! I did not include the reclaim threshold
> experimental feature I am playing around with. Also, there was a bug in
> the following function `mem_thread_sys_reclaim()'. Here was the original
> code:

I improved the test application; the web-page is updated. See, before, it
was actually calling `my_malloc/free' functions under the protection of the
test mutex. Well, that's no good. So, I moved them outside of the
critical-region. Now, it's would be an interesting experiment to run both
versions of the test. One with the allocation functions in the
critical-section, and one without. On my end, my algorithm beats normal
malloc/free in either occasion, but really whips them when they are not
protected by the mutex. Can you please post some numbers? I am curious to
see how horrible stock malloc/free actually are!

;^D

Chris M. Thomasson

unread,

Oct 8, 2008, 5:44:19 AM10/8/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:ACWGk.2030$982...@newsfe02.iad...

>I have developed a non-blocking algorithm that allows one to transform an
>inherently single-threaded memory allocator into a scaleable multi-threaded
>allocator. This invention allows a programmer to just create an allocator
>algorithm without worrying about any multi-threading issues. Once your done
>with your single-threaded allocator, well, when you apply my non-blocking
>algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is
>that cool or what?

[...]

The algorithm works, however, I don't know how useful it is. I would just
use low-frag heap on Windows instead of this. I am not seeing a real benefit
from using per-thread Windows Heaps. I basically wanted to prove that it can
be done. There are several caveats that come along with the design because
basically it needs to call into the base allocator interface, Windows Heap
API in this case, for every alloc/free. In other words, it currently does
not attempt to fast-path it. Apparently, the Heap API is fairly expensive
even with the `HEAP_NO_SERIALIZE' flag set. I need to create the plug-in
version and see if any other allocator designs would work better.

Still, through, I like the idea of being able to transform single-threaded
allocator into multi-threaded one. It has to provide some scalability
enhancements on multi-cpu systems. It just seems like it would be a given.

I am displeased with some of the numbers I am getting. For instance, I
simply cannot get it to come anywhere near the performance of the following
allocator:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html

I probably need to redo the test application and redo the timing code which
in totally imprecise to say the least.

So, final word... Stay tuned. Sorry to waste your time. Well, the algorithm
is definitely interesting to say the least...

Any thoughts?