RFC: Virtually Zero-Overhead Distributed Memory Allocator...

154 views
Skip to first unread message

Chris Thomasson

unread,
Sep 4, 2006, 5:45:56 PM9/4/06
to
Okay... No lengthy white papers yet...

;)


Here is the first pseudo-code implementation I created for my new allocator:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82
(read all please)


Well, that's simple enough, right? Once you apply the fixes that I outlined,
the pseudo-code is 100% bug free.


However...The presented design is not the best way to implement my design...
Here are some quick notes on the implementation I finally decided to use in
my vZOOM library, which will be available soon:


http://groups.google.com/group/comp.programming.threads/msg/bc019dc04362be41

http://groups.google.com/group/comp.programming.threads/msg/1d3aeaa6ebf5181f

There you have it... This invention is my version of a highly scaleable
low-overhead lock-free slab-based memory allocator; very fast, IMHO... What
do you think of my design?


Any comments/critiques/suggestions/help are encouraged and will be
appreciated!


Thank you for you time...

BTW.... I included comp.benchmarks because I think my multi-threaded
allocator has the potential to get high marks across the board...


Chris Thomasson

unread,
Sep 4, 2006, 7:36:54 PM9/4/06
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:p7WdnVL00vKkAWHZ...@comcast.com...

> Okay... No lengthy white papers yet...
>
> ;)
>
>
> Here is the first pseudo-code implementation I created for my new
> allocator:
>
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82
> (read all please)
>
>
> Well, that's simple enough, right? Once you apply the fixes that I
> outlined, the pseudo-code is 100% bug free.

I should clarify that the fixes are relevant to the atomic operations... I
think there is a misspelled word on the one of the local pointer accesses;
but it is not really relevant because the fix is so obvous...

If you have ANY trouble with the pseudo-code please let me know.

IMHO, it should be easily understood...


jacko

unread,
Sep 5, 2006, 10:57:57 AM9/5/06
to
hi

always prefered the circular linked list implement with list handle
pointing to the tail, so that tail can be got, and next is head.

would be quite good for parallel cpu this with seperate memory spaces?
umm, looking for something which has possible overlap of part of memory
spaces, as this eases communication in certain mem bus topologies.

i do wonder if a pipelined parallel bit ring is best local net for
wafer scale. i.e. serial shift in and out, but parallel shift arround
loop??

or just a giga bit shift ring on each cpu corner, of length 4 bits
total. and this would need some kind of variable address relocation
scheme for operation by more than 1 processor. a global export var
handle process running on 1 cpu would be needed, unless cpu had ip and
global variable was indexed by ip/port num pair

2KB blocks efficient ///

cheers.

Greg Lindahl

unread,
Sep 5, 2006, 12:09:34 PM9/5/06
to
In article <p7WdnVL00vKkAWHZ...@comcast.com>,
Chris Thomasson <cri...@comcast.net> wrote:

>There you have it... This invention is my version of a highly scaleable
>low-overhead lock-free slab-based memory allocator; very fast, IMHO... What
>do you think of my design?

You asked the same question a while ago, right? If so, how do you
respond to the comments which were made back then?

-- g

Sean Kelly

unread,
Sep 5, 2006, 7:37:00 PM9/5/06
to
Chris Thomasson wrote:
>
> Here is the first pseudo-code implementation I created for my new allocator:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82
> (read all please)

It's a good design. In fact, it's what I've been saying they should
have done in Hoard for a while now :-). I think Hans Boehm said
something to the effect in his paper on a "mostly parallel" garbage
collector as well. That said, I'd be interested in seeing a
performance comparison with Doug Lea's allocator plugged into this vs.
some of the other high-performance multithreaded allocators.


Sean

Chris Thomasson

unread,
Sep 5, 2006, 8:31:59 PM9/5/06
to
"Sean Kelly" <se...@f4.ca> wrote in message
news:1157499420.0...@h48g2000cwc.googlegroups.com...

> Chris Thomasson wrote:
>>
>> Here is the first pseudo-code implementation I created for my new
>> allocator:
>>
>> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82
>> (read all please)
>
> It's a good design.

Thank you.


> In fact, it's what I've been saying they should
> have done in Hoard for a while now :-).

I think Hoard has per-thread allocators, however, my design radically
differs from theirs. Is that what you were referring to?


> I think Hans Boehm said
> something to the effect in his paper on a "mostly parallel" garbage
> collector as well.

I think I read a paper on that... FWIW, my vZOOM library has the allocator
hooked up to my proxy garbage collector design:


http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/65c9c2673682d4cf/bc1506fb6d0e3ba7?lnk=gst&q=vzoom&rnum=1#bc1506fb6d0e3ba7


This integrates a solution for the reader/writers problem in lock-free
data-structures directly into the lock-free allocator... Therefore, the
allocator supports lock-free reader patterns directly:


https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068


This is a nice feature for an allocator to support?


> That said, I'd be interested in seeing a
> performance comparison with Doug Lea's allocator plugged into this vs.
> some of the other high-performance multithreaded allocators.

I was thinking about comparing to stuff like:


http://portal.acm.org/citation.cfm?id=996893.996848

http://portal.acm.org/citation.cfm?id=773039.512451&coll=GUIDE&dl=ACM&idx=J706&part=periodical&WantType=periodical&title=ACM%20SIGPLAN%20Notices&CFID=11111111&CFTOKEN=2222222

http://citeseer.ist.psu.edu/johnson91concurrent.html


What do you think?


IMHO, I think that my multithreaded allocator has the "potential" to perform
as good, or better, than most existing multithread designs...


After I get my act together and publish the x86-32 and SPARC-32/64 versions
of my vZOOM library I can provide a non-commercial academic style license
that will allow people to benchmark and experiment... Of course, I will also
provide commercial license(s) for those who are interested...

;)

Thank you for your comments...


Chris Thomasson

unread,
Sep 5, 2006, 8:32:45 PM9/5/06
to
"Greg Lindahl" <lin...@pbm.com> wrote in message
news:44fda13e$1...@news.meer.net...

I did not get any responses.


;^(.....

Could you please provide a link? Perhaps I missed something...


Thank you.


Chris Thomasson

unread,
Sep 5, 2006, 8:36:44 PM9/5/06
to
"jacko" <jacko...@gmail.com> wrote in message
news:1157468277....@m79g2000cwm.googlegroups.com...
> hi

[...]

> always prefered the circular linked list implement with list handle
> pointing to the tail, so that tail can be got, and next is head.

A circular linked list would be perfectly compatible with this design...
However, you should try to minimize the per-block storage you need to
maintain the list...


> would be quite good for parallel cpu this with seperate memory spaces?

Indeed.

> umm, looking for something which has possible overlap of part of memory
> spaces, as this eases communication in certain mem bus topologies.
>
> i do wonder if a pipelined parallel bit ring is best local net for
> wafer scale. i.e. serial shift in and out, but parallel shift arround
> loop??

FWIW, this allocator is NUMA friendly...


Sean Kelly

unread,
Sep 6, 2006, 2:13:15 AM9/6/06
to
Chris Thomasson wrote:
> "Sean Kelly" <se...@f4.ca> wrote in message
> news:1157499420.0...@h48g2000cwc.googlegroups.com...
> > Chris Thomasson wrote:
> >>
> >> Here is the first pseudo-code implementation I created for my new
> >> allocator:
> >>
> >> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82
> >> (read all please)
>
> > In fact, it's what I've been saying they should
> > have done in Hoard for a while now :-).
>
> I think Hoard has per-thread allocators, however, my design radically
> differs from theirs. Is that what you were referring to?

I was referring to the lock-free LIFO queue for deletions. Hoard has
per-thread allocators, but all allocations and deletions are protected
by a mutex and deletions are processed by the deleting thread.
Appending 'foreign' deletions to a per-thread lock-free queue instead
seemed an obvious way to improve performance.

> > I think Hans Boehm said
> > something to the effect in his paper on a "mostly parallel" garbage
> > collector as well.
>
> I think I read a paper on that... FWIW, my vZOOM library has the allocator
> hooked up to my proxy garbage collector design:
> http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/65c9c2673682d4cf/bc1506fb6d0e3ba7?lnk=gst&q=vzoom&rnum=1#bc1506fb6d0e3ba7

I've been trying to find the time to do something like this in a
traditional garbage collector, but other projects have taken
precedence. Unfortunately, it's not really possible to implement smart
pointers in my current language of choice, so proxy GC isn't really
feasible.

> This integrates a solution for the reader/writers problem in lock-free
> data-structures directly into the lock-free allocator... Therefore, the
> allocator supports lock-free reader patterns directly:
> https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068
>
> This is a nice feature for an allocator to support?

Assuming a proxy GC design is an option, yes.

> > That said, I'd be interested in seeing a
> > performance comparison with Doug Lea's allocator plugged into this vs.
> > some of the other high-performance multithreaded allocators.
>
> I was thinking about comparing to stuff like:
>
> http://portal.acm.org/citation.cfm?id=996893.996848
> http://portal.acm.org/citation.cfm?id=773039.512451&coll=GUIDE&dl=ACM&idx=J706&part=periodical&WantType=periodical&title=ACM%20SIGPLAN%20Notices&CFID=11111111&CFTOKEN=2222222
>
> http://citeseer.ist.psu.edu/johnson91concurrent.html
>
> What do you think?

I've only read the first paper, but they all look like good candidates.


Sean

Chris Thomasson

unread,
Sep 7, 2006, 8:40:26 PM9/7/06
to
"Sean Kelly" <se...@f4.ca> wrote in message
news:1157523195.4...@e3g2000cwe.googlegroups.com...

> Chris Thomasson wrote:
>> "Sean Kelly" <se...@f4.ca> wrote in message
>> news:1157499420.0...@h48g2000cwc.googlegroups.com...
>> > Chris Thomasson wrote:
>> >>
>> >> Here is the first pseudo-code implementation I created for my new
>> >> allocator:
>> >>
>> >> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69/e3efa5628aad4a82?lnk=gst&q=lock-free+distributed&rnum=1#e3efa5628aad4a82
>> >> (read all please)
>>
>> > In fact, it's what I've been saying they should
>> > have done in Hoard for a while now :-).
>>
>> I think Hoard has per-thread allocators, however, my design radically
>> differs from theirs. Is that what you were referring to?
>
> I was referring to the lock-free LIFO queue for deletions. Hoard has
> per-thread allocators, but all allocations and deletions are protected
> by a mutex and deletions are processed by the deleting thread.

Ah...


> Appending 'foreign' deletions to a per-thread lock-free queue instead
> seemed an obvious way to improve performance.

Yeah... Well, you need some form of atomic reference counting. My allocator
code has a special form of atomic reference counting (check pseudo-code)...
Have you seen this type of reference counting before?


I think I may have invented a specialized form that avoids memory barriers
and atomic operations for the lifetime of the entity that is protected by
the count.... The pseudo-code example shows how to keep a per-thread
allocator object alive if there are any outstanding references' to any
blocks in any of its slabs... I think I have a novel design...


Any thoughts?


>> > I think Hans Boehm said
>> > something to the effect in his paper on a "mostly parallel" garbage
>> > collector as well.
>>
>> I think I read a paper on that... FWIW, my vZOOM library has the
>> allocator
>> hooked up to my proxy garbage collector design:
>> http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/65c9c2673682d4cf/bc1506fb6d0e3ba7?lnk=gst&q=vzoom&rnum=1#bc1506fb6d0e3ba7
>
> I've been trying to find the time to do something like this in a
> traditional garbage collector, but other projects have taken
> precedence. Unfortunately, it's not really possible to implement smart
> pointers in my current language of choice, so proxy GC isn't really
> feasible.

You can design proxy gc by tracking per-thread/cpu/whatever synchronization
epochs. You don't need smart pointers to implement proxy gc. You can do a
prototype in Java by using volatile... However, the Java memory model is so
strict that it would force your reader threads to use load-acquire barriers
(#LoadStore | #LoadLoad) during their tedious traversals throughout your
applications shared data-structures. Not worth the trouble in Java.


>> This integrates a solution for the reader/writers problem in lock-free
>> data-structures directly into the lock-free allocator... Therefore, the
>> allocator supports lock-free reader patterns directly:
>> https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068
>>
>> This is a nice feature for an allocator to support?
>
> Assuming a proxy GC design is an option, yes.

I thought so! I need some more feedback... This thread is not receiving the
traffic I thought it might encounter...

;)


>> > That said, I'd be interested in seeing a
>> > performance comparison with Doug Lea's allocator plugged into this vs.
>> > some of the other high-performance multithreaded allocators.
>>
>> I was thinking about comparing to stuff like:
>>
>> http://portal.acm.org/citation.cfm?id=996893.996848
>> http://portal.acm.org/citation.cfm?id=773039.512451&coll=GUIDE&dl=ACM&idx=J706&part=periodical&WantType=periodical&title=ACM%20SIGPLAN%20Notices&CFID=11111111&CFTOKEN=2222222
>>
>> http://citeseer.ist.psu.edu/johnson91concurrent.html
>>
>> What do you think?
>
> I've only read the first paper, but they all look like good candidates.

Yup... Also... I need to bench against leapheap...


Reply all
Reply to author
Forward
0 new messages