http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0b3166db3c5518d
I have a full-blown framework that can transform your single-threaded memory
allocator implementations into full blown 100% lock-free multi-threaded
allocators.
I know this sounds too good to be true, but if you are interested.; I can
provide you with the solution.
Any questions/comments?
--
Chris M. Thomasson
http://appcore.home.comcast.net
> I have a full-blown framework that can transform your single-threaded memory
> allocator implementations into full blown 100% lock-free multi-threaded
> allocators.
> Any questions/comments?
Let me guess :)
You create thread local user allocator in every thread.
Alloc request is forwarded to this thread-local user allocator
directly.
If free request goes from thread that allocate block, then free
request is forwarded to this thread-local user allocator.
If free request goes from another thread, then you accumulate this
block in per-thread stack-based freelist.
Blocks from this freelist is actually freed in batches when [here can
be variations] thread allocates/deallocates another block.
Am I divine? :)
Dmitriy V'jukov
Not divine, just smart!
:^)
And we can amortize CAS operations that we have when thread freeing
foreign block.
Thread can organize thread-local list of lists of foreign blocks. I.e.
thread can cache foreign blocks in TLS parted by foreign thread ID.
And then put it on foreign thread freelist stack in batch.
This would be beneficial in patterns when one thread freeing foreign
blocks permanently (producer-consumer), and this would be not bad in
other situations.
Maybe you already implement this... :)
Or we can create N^2 spsc-queues (everyone-to-everyone), and pass
foreign block through them. And thread not have to multiplex them, he
can just get free block from the first queue comes across. And if in
former scheme we have to execute few CAS operations when batch pushed
or popped, here we don't have any CAS operations at all! Totally!
There just would be no CAS in entire code! And we get multithreaded
multicore-friendly allocator w/o CAS! I just can't believe that!
Yippee-Yay!
Dmitriy V'jukov
> Or we can create N^2 spsc-queues (everyone-to-everyone), and pass
> foreign block through them. And thread not have to multiplex them, he
> can just get free block from the first queue comes across. And if in
> former scheme we have to execute few CAS operations when batch pushed
> or popped, here we don't have any CAS operations at all! Totally!
> There just would be no CAS in entire code! And we get multithreaded
> multicore-friendly allocator w/o CAS! I just can't believe that!
> Yippee-Yay!
Yippee-Yay one more time! :)
It would be interesting to compare performance of some of state-of-art
inherently multithreaded allocator versus some of state-of-art
inherently singlethreaded allocator with such scheme. And I think the
latter blow the former out of the water :)
For example Maged M. Michael's "Scalable Lock-Free Dynamic Memory
Allocation" has at least 2 CAS per allocation and at least 2 CAS per
deallocation - total 4 CAS per memory block.
But probably the latter would have larger footprint...
Dmitriy V'jukov
Indeed I have used various per-thread caching algorithms... However, in real
world use, its good to keep in mind that a user-implemented
producer-consumer algorithm is likely to have it own built-in caching
mechanism...
Yup. The state-of-the-art allocator pseudo-code I posted completely blows
Michael's algorithm out of the water. I would recommend testing against
hoard. It uses some per-thread techniques... However, the pseudo-code still
beats it.
2 CAS for malloc/free is just as expensive as accessing a non-contended
mutex. So, what's the difference in performance between any 2-CAS
per-function lock-free algorithm and an uncontended mutex per-function
algorithm? Not much!
:^0
Okay, here is the deal everybody:
You can create a simple 100% single-threaded allocator; no locks, no
atomic-ops, no nothing. Plain loads and stores will work just fine. Assume
that only 1 thread will ever be using your allocator at any one time. Fine.
After you complete the implementation of your single-threaded allocator, you
can use a special algorithm I invented which allows you to transform that
single-threaded allocator of yours into a 100% lock-free, distributed memory
allocator; no kidding. Is that neat or what?
-- This is turning out to be a VERY powerful tool indeed... Wow! --
:^0
I used the invention to transform a single-threaded windows Heap API:
http://msdn2.microsoft.com/en-us/library/aa366711.aspx
into a scaleable multi-threaded allocator. You use the HEAP_NO_SERIALIZE
flag for the HeapCreate function:
http://msdn2.microsoft.com/en-us/library/aa366599.aspx
in order to make it single-threaded..
The funny thing about this is... Well, the transformed windows heap beats
the pants of the multi-threaded heap created without the HEAP_NO_SERIALIZE
flag set...
lol.
I transformed the version of the malloc/free functions exported by the
single-threaded windows c-runtime dll. The transformed single-threaded
version outperforms the version exported by the multi-threaded windows
c-runtime dll...
Neat!
I basically looked at the source code, and recreated it using a C++ class so
I could create a class per-thread.
> I transformed the version of the malloc/free functions exported by the
> single-threaded windows c-runtime dll. The transformed single-threaded
> version outperforms the version exported by the multi-threaded windows
> c-runtime dll...
Can you provide some rough numbers? For some benchmark. Maybe
syntetiс.
Otherwise, you know, "outperforms" is very... vague...
Dmitriy V'jukov
They will be on the web site. Which should be up in about 40 days.
> They will be on the web site. Which should be up in about 40 days.
Cool! It will be interesting to see...
Dmitriy V'jukov