Memory allocation

Barry

unread,

Apr 5, 2010, 10:13:00 PM4/5/10

to Chicago Erlang User Group

Did anyone see this? http://news.ncsu.edu/releases/wmssolihinthreads/

It appears that they discovered a memory allocation scheme that Erlang
is already at least partially using or will be using in R14/15.

Erlang Factory 2010 San Francisco had a talk given by Patrik Nyblom
where he talked about many of the optimizations that have been placed
into the Erlang VM including some related to memory. One of the
optimizations that he talked about sounds remarkably close to that
described in the PR blurb from NCSU.

I'd love to get my hands on that paper.

Anyone attending IEEE International Parallel and Distributed
Processing Symposium in Atlanta?

Barry

Martin Logan

unread,

Apr 6, 2010, 11:11:31 AM4/6/10

to ce...@googlegroups.com

Interesting article. I wonder if the performance gains from this are
predictable and consistent. I can't see the communication between the
worker and the memory manager thread as being asynchronous most of the
time. I imagine there are cases where the overhead of communication
coupled with a slow memory management operation could cause more
delay.

Cheers,
Martin

> --
> You received this message because you are subscribed to the Google Groups "Chicago Erlang User Group" group.
> To post to this group, send email to ce...@googlegroups.com.
> To unsubscribe from this group, send email to ceug+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/ceug?hl=en.
>
>

Barry Nicholson

unread,

Apr 7, 2010, 4:21:57 PM4/7/10

to ce...@googlegroups.com

That's exactly why I think the Erlang solution (Actors) is so much
better. I'm just curious if similar techniques could be used or are
being using inside the Erlang VM. Someday, I need to go read the
Erlang VM source code.

Barry

Tristan Sloughter

unread,

Apr 7, 2010, 4:30:32 PM4/7/10

to ce...@googlegroups.com

Yeah, I have no idea how this increases performance on a memory allocation... It has to wait till that space is allocated.

Tristan Sloughter

unread,

Apr 7, 2010, 4:31:33 PM4/7/10

to ce...@googlegroups.com

And it acts like it does is what I meant :). “the computational thread notifies the memory-management thread – effectively telling it to allocate data storage and to notify the computational thread of where the storage space is located”.

Christopher Atkins

unread,

Apr 7, 2010, 6:03:43 PM4/7/10

to ce...@googlegroups.com

The paper is here:

http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf

As I read it, there is a classic time-space trade-off, with some very interesting work to make the trade-off worthwhile. A naive implementation was 2.7x slower while their "final" version was 1.19x faster. Importantly, they can apply it without modifying the original application source code, nor the memory management library employed by the application.

Basically the approach is to create a client-server scheme where the client logic is resident in the application thread and the server logic is in a so-called Memory Management Thread (MMT). The calls to the memory management library in the application thread are effectively intercepted and sent to the MMT via request/reply queues.

The two main problems with the approach are communication latencies and synchronization latencies. They have a number of techniques to mitigate these.

Deallocations can be performed asynch, so the client logic just keeps a bucket of these queued (up to 200 based on experimental maxima).

To reduce the number of allocation request latencies, they perform aggressive, speculative pre-allocation of a bucket of chunks every time a new size (chunk) of object is requested, if the size is <512K (another experimental maxima that reduces fragmentation).

They trade out pthread synchronization primitives for "lock free" approaches:

a) the client semaphore (wake up you have work to do) and server semaphore (done!) are maintained independently by the two threads. So the MMT can read the client semaphore but not write to it. It compares it's semaphore count with the client's to see if there is work to do.

b) the request/reply queue are alternately read-only/read-write to the client and server respectively. At the end of a pre-defined time period (epoch), these roles switch. So, they don't need locks on those queues. <-- this is tha awesome

Hopefully this is a pretty accurate summary. It feels like injecting VM-style memory management with AOP, metaphorically speaking.

John H Patton

unread,

Apr 7, 2010, 6:50:00 PM4/7/10

to Chicago Erlang User Group

Awesome! Thanks for posting the paper and the overview. I find it
interesting that it doesn't need to be tuned for specific applications
since there's little performance gain for bucket sizes outside
200-400. Nearly 20% improvement just to add in MMT with no
modification to the application is huge.

When is R14/15 coming out? I'm looking on github and erlang.org...
there isn't a release calendar posted anywhere. This looks like it
might be fun to mess around with.

Cheers,

-john

Barry

unread,

Apr 8, 2010, 4:38:20 PM4/8/10

to Chicago Erlang User Group

Kenneth Lundin of the Ericsson Erlang/OTP team gave the following
dates for R14 in his talk at the San Francisco Erlang Factory in
March.

June 16, 2010: R14A, a beta release
Sept 01, 2010: R14B first drop for commercial use

To me it looks like Erlang in essence already does what this paper
describes or will in R14 but in a more lock free, synchronization
required free way. Once again, Erlang is ahead of the trend.

Ironically, I wonder whether this new method would help high quality C/
C++ code anyway. Many of the programs I've worked with in the past
already do speculative preallocation and deallocation of structures/
classes. It makes constructing a new object considerably faster and
deallocation much faster. Granted these aren't desktop programs.

Barry Nicholson

Tristan Sloughter

unread,

Apr 8, 2010, 4:45:06 PM4/8/10

to ce...@googlegroups.com

Good point. Wouldn't that help speed without being multi-threaded anyway? I'd like to see a comparison between those two methods (assuming I'm correct...) and what the speedup is between the two.

Reply all

Reply to author

Forward