Transforming single-thread malloc impls into lock-free multi-thread ones...

Chris Thomasson

unread,

May 17, 2007, 3:45:41 PM5/17/07

to

I was thinking of holding off on this until after I post the vZOOM website:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0b3166db3c5518d

I have a full-blown framework that can transform your single-threaded memory
allocator implementations into full blown 100% lock-free multi-threaded
allocators.

I know this sounds too good to be true, but if you are interested.; I can
provide you with the solution.

Any questions/comments?

--
Chris M. Thomasson
http://appcore.home.comcast.net

Dmitriy Vyukov

unread,

May 18, 2007, 3:52:27 AM5/18/07

to

On May 17, 11:45 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> I have a full-blown framework that can transform your single-threaded memory
> allocator implementations into full blown 100% lock-free multi-threaded
> allocators.

> Any questions/comments?

Let me guess :)
You create thread local user allocator in every thread.
Alloc request is forwarded to this thread-local user allocator
directly.
If free request goes from thread that allocate block, then free
request is forwarded to this thread-local user allocator.
If free request goes from another thread, then you accumulate this
block in per-thread stack-based freelist.
Blocks from this freelist is actually freed in batches when [here can
be variations] thread allocates/deallocates another block.

Am I divine? :)

Dmitriy V'jukov

Chris Thomasson

unread,

May 18, 2007, 5:15:32 PM5/18/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
news:1179474747....@w5g2000hsg.googlegroups.com...

Not divine, just smart!

:^)

Dmitriy Vyukov

unread,

May 19, 2007, 10:07:34 AM5/19/07

to

On May 18, 11:52 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 17, 11:45 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
> > I have a full-blown framework that can transform your single-threaded memory
> > allocator implementations into full blown 100% lock-free multi-threaded
> > allocators.
> > Any questions/comments?
>
> Let me guess :)

And we can amortize CAS operations that we have when thread freeing
foreign block.
Thread can organize thread-local list of lists of foreign blocks. I.e.
thread can cache foreign blocks in TLS parted by foreign thread ID.
And then put it on foreign thread freelist stack in batch.
This would be beneficial in patterns when one thread freeing foreign
blocks permanently (producer-consumer), and this would be not bad in
other situations.
Maybe you already implement this... :)

Or we can create N^2 spsc-queues (everyone-to-everyone), and pass
foreign block through them. And thread not have to multiplex them, he
can just get free block from the first queue comes across. And if in
former scheme we have to execute few CAS operations when batch pushed
or popped, here we don't have any CAS operations at all! Totally!
There just would be no CAS in entire code! And we get multithreaded
multicore-friendly allocator w/o CAS! I just can't believe that!
Yippee-Yay!

Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 19, 2007, 10:27:46 AM5/19/07

to

On May 19, 6:07 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> Or we can create N^2 spsc-queues (everyone-to-everyone), and pass
> foreign block through them. And thread not have to multiplex them, he
> can just get free block from the first queue comes across. And if in
> former scheme we have to execute few CAS operations when batch pushed
> or popped, here we don't have any CAS operations at all! Totally!
> There just would be no CAS in entire code! And we get multithreaded
> multicore-friendly allocator w/o CAS! I just can't believe that!
> Yippee-Yay!

Yippee-Yay one more time! :)

It would be interesting to compare performance of some of state-of-art
inherently multithreaded allocator versus some of state-of-art
inherently singlethreaded allocator with such scheme. And I think the
latter blow the former out of the water :)
For example Maged M. Michael's "Scalable Lock-Free Dynamic Memory
Allocation" has at least 2 CAS per allocation and at least 2 CAS per
deallocation - total 4 CAS per memory block.
But probably the latter would have larger footprint...

Dmitriy V'jukov

Chris Thomasson

unread,

May 20, 2007, 2:01:57 AM5/20/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1179583654....@e65g2000hsc.googlegroups.com...

> On May 18, 11:52 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>> On May 17, 11:45 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>> > I have a full-blown framework that can transform your single-threaded
>> > memory
>> > allocator implementations into full blown 100% lock-free multi-threaded
>> > allocators.
>> > Any questions/comments?
>>
>> Let me guess :)
>
> And we can amortize CAS operations that we have when thread freeing
> foreign block.
> Thread can organize thread-local list of lists of foreign blocks. I.e.
> thread can cache foreign blocks in TLS parted by foreign thread ID.
> And then put it on foreign thread freelist stack in batch.
> This would be beneficial in patterns when one thread freeing foreign
> blocks permanently (producer-consumer), and this would be not bad in
> other situations.
> Maybe you already implement this... :)

Indeed I have used various per-thread caching algorithms... However, in real
world use, its good to keep in mind that a user-implemented
producer-consumer algorithm is likely to have it own built-in caching
mechanism...

Chris Thomasson

unread,

May 20, 2007, 2:51:53 AM5/20/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1179584866.1...@q75g2000hsh.googlegroups.com...

Yup. The state-of-the-art allocator pseudo-code I posted completely blows
Michael's algorithm out of the water. I would recommend testing against
hoard. It uses some per-thread techniques... However, the pseudo-code still
beats it.

2 CAS for malloc/free is just as expensive as accessing a non-contended
mutex. So, what's the difference in performance between any 2-CAS
per-function lock-free algorithm and an uncontended mutex per-function
algorithm? Not much!

:^0

Chris Thomasson

unread,

May 20, 2007, 3:23:25 AM5/20/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:VLGdnfecX7_cLdHb...@comcast.com...

>I was thinking of holding off on this until after I post the vZOOM website:

Okay, here is the deal everybody:

You can create a simple 100% single-threaded allocator; no locks, no
atomic-ops, no nothing. Plain loads and stores will work just fine. Assume
that only 1 thread will ever be using your allocator at any one time. Fine.
After you complete the implementation of your single-threaded allocator, you
can use a special algorithm I invented which allows you to transform that
single-threaded allocator of yours into a 100% lock-free, distributed memory
allocator; no kidding. Is that neat or what?

-- This is turning out to be a VERY powerful tool indeed... Wow! --

:^0

Chris Thomasson

unread,

May 20, 2007, 3:30:29 AM5/20/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:NMKdnSoajYtYa9Lb...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:VLGdnfecX7_cLdHb...@comcast.com...
>>I was thinking of holding off on this until after I post the vZOOM
>>website:
>

[...]

> After you complete the implementation of your single-threaded allocator,
> you can use a special algorithm I invented which allows you to transform
> that single-threaded allocator of yours into a 100% lock-free, distributed
> memory allocator; no kidding. Is that neat or what?
>
> -- This is turning out to be a VERY powerful tool indeed... Wow! --
>
> :^0
>

I used the invention to transform a single-threaded windows Heap API:

http://msdn2.microsoft.com/en-us/library/aa366711.aspx

into a scaleable multi-threaded allocator. You use the HEAP_NO_SERIALIZE
flag for the HeapCreate function:

http://msdn2.microsoft.com/en-us/library/aa366599.aspx

in order to make it single-threaded..

The funny thing about this is... Well, the transformed windows heap beats
the pants of the multi-threaded heap created without the HEAP_NO_SERIALIZE
flag set...

lol.

Chris Thomasson

unread,

May 20, 2007, 3:34:14 AM5/20/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:NMKdnSoajYtYa9Lb...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:VLGdnfecX7_cLdHb...@comcast.com...
>>I was thinking of holding off on this until after I post the vZOOM
>>website:
>

I transformed the version of the malloc/free functions exported by the
single-threaded windows c-runtime dll. The transformed single-threaded
version outperforms the version exported by the multi-threaded windows
c-runtime dll...

Neat!

Chris Thomasson

unread,

May 20, 2007, 3:41:53 AM5/20/07

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:jaadnbetxKTTZNLb...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:NMKdnSoajYtYa9Lb...@comcast.com...
>> "Chris Thomasson" <cri...@comcast.net> wrote in message
>> news:VLGdnfecX7_cLdHb...@comcast.com...
>>>I was thinking of holding off on this until after I post the vZOOM
>>>website:
>>
>
> I transformed the version of the malloc/free functions exported by the
> single-threaded windows c-runtime dll.

I basically looked at the source code, and recreated it using a C++ class so
I could create a class per-thread.

Dmitriy Vyukov

unread,

May 20, 2007, 10:38:10 AM5/20/07

to

On May 20, 11:34 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> I transformed the version of the malloc/free functions exported by the
> single-threaded windows c-runtime dll. The transformed single-threaded
> version outperforms the version exported by the multi-threaded windows
> c-runtime dll...

Can you provide some rough numbers? For some benchmark. Maybe
syntetiс.
Otherwise, you know, "outperforms" is very... vague...

Dmitriy V'jukov

Chris Thomasson

unread,

May 20, 2007, 5:40:13 PM5/20/07

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1179671890.3...@p77g2000hsh.googlegroups.com...

They will be on the web site. Which should be up in about 40 days.

Dmitriy Vyukov

unread,

May 20, 2007, 6:27:01 PM5/20/07

to

On May 21, 1:40 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> They will be on the web site. Which should be up in about 40 days.

Cool! It will be interesting to see...

Dmitriy V'jukov