JVM memory management slower than C?

9,281 views
Skip to first unread message

baboune

unread,
Sep 24, 2013, 4:37:30 PM9/24/13
to mechanica...@googlegroups.com
Hi,

Maybe this is a stupid question, so be patient. I think this is related to mechanical sympathy..

I was in a conference today (https://www.sics.se/events/cloud-and-big-data-day-2013-program), and Matei Zaharia (Berkely) said that they were considering re-writing spark in C/C++ in order to see if it would go faster.  He then blamed the performance cost on Java memory management.  

Once you go off-heap for the big data management parts (which he indicated spark is already doing), are the GC costs + Object costs so large as to see such a difference in performance? From a memory perspective does the JVM impose some limits on what can be done to manage memory?

Thanks

Kevin Burton

unread,
Sep 24, 2013, 4:40:53 PM9/24/13
to mechanica...@googlegroups.com
I've found that ByteBuffer is far from fun to work with.  There are also some severe bugs here which make it difficult to work with.

I took the approach of dumping it and bypassing the whole thing by using unsafe.  

That was NOT fun and cost me about 1-1.5 months of refactoring my code base.  

I really dislike hidden landmines like this ... at least in C the problems are somewhat understood ahead of time.

But the fact that Java is limited like this isn't fun.

Matei Zaharia

unread,
Sep 24, 2013, 4:54:41 PM9/24/13
to mechanica...@googlegroups.com
Matei here -- for what it's worth in case people are reading this, I never said we're considering rewriting it in C++. I think that would be crazy! What I said is that we've worked around issues with naive use of Java objects by doing our own memory layout. Maybe I also said that for Spark Streaming, if you wanted to make the latency 10x lower, you'd have to avoid minor GCs, but that is more of a research project than a plan to rewrite anything.

Matei

Peter Lawrey

unread,
Sep 24, 2013, 5:16:21 PM9/24/13
to mechanica...@googlegroups.com

I have not come across any suggestion that memory allocation in C++ is any faster than Java. What makes a difference is gc pauses.  Imho if you can live without managed memory. You can do this in java as well and eliminate gc pauses.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Todd Lipcon

unread,
Sep 24, 2013, 5:18:43 PM9/24/13
to mechanica...@googlegroups.com
On Tue, Sep 24, 2013 at 2:16 PM, Peter Lawrey <peter....@gmail.com> wrote:

I have not come across any suggestion that memory allocation in C++ is any faster than Java. What makes a difference is gc pauses.  Imho if you can live without managed memory. You can do this in java as well and eliminate gc pauses.


I think naive memory management in C++ (eg malloc/free or new/delete) is really quite comparable to allocation in Java.

On the other hand, C++ allows you to do more clever things like allocate objects out of arenas and "bulk free" these objects if they are POD types without destructors. These kinds of things are impossible to do in Java, and can be hurtful for performance (and potentially have ramifications for memory fragmentation as well)

-Todd

Matt Fowles

unread,
Sep 24, 2013, 5:22:09 PM9/24/13
to mechanica...@googlegroups.com
Peter~

In my experience the biggest difference is stack allocation.  In C++ temp objects are usually on the stack.  Hence their allocation is a thread local pointer bump (like java), but their collection is trivial and requires no additional work (unlike java).  Also the fact that nested structures don't lead to pointer indirections helps too.

Honestly, with the convenience features of C++11, C++ is becoming a downright pleasant language to develop in.

Matt

Peter Lawrey

unread,
Sep 24, 2013, 5:27:52 PM9/24/13
to mechanica...@googlegroups.com

Native C allocation is single threaded. Java object allocation is multi threaded.

I agree that stack allocation doesnt work well in java even with escape analysis. However if this is important to you there are workarounds. ;)

Todd Lipcon

unread,
Sep 24, 2013, 5:35:06 PM9/24/13
to mechanica...@googlegroups.com
On Tue, Sep 24, 2013 at 2:27 PM, Peter Lawrey <peter....@gmail.com> wrote:

Native C allocation is single threaded. Java object allocation is multi threaded.


That's not true -- if you use any half decent allocator, you get thread local allocations and good concurrent performance in C++. For example, tcmalloc performs well under concurrency.
 

I agree that stack allocation doesnt work well in java even with escape analysis. However if this is important to you there are workarounds. ;)


Sure - I don't think anyone debates that you can workaround a lot of this stuff with unsafe and big buffers. But then you lose a lot of the advantages of an object-oriented language.

Peter Lawrey

unread,
Sep 24, 2013, 5:36:07 PM9/24/13
to mechanica...@googlegroups.com

In my experience, how you use the languafe and srructure your program matters more than the choice of language.  Where there is a difference is C++ attracts developers interested in the low level details and even in java many of the best low level developers have worked in C++.

In short, to get the best results, change the developers rather than or as well as the language.

Howard Chu

unread,
Sep 25, 2013, 3:47:33 AM9/25/13
to mechanica...@googlegroups.com
The language influences the developers. (Just as human languages influences human thought, and richness of language enables richness of expressiveness. And the opposite as well, e.g. George Orwell's Newspeak, designed to constrict human thought by shrinking vocabulary.) You most likely have to change both.

If you stick to the pure java language features, you're extremely restricted. And the usual also applies - it is possible for a GC to achieve the performance of native/explicit memory allocation, but only at the cost of 2x memory consumption. In a memory-constrained environment, like a smartphone, you pay visibly for the choices you make. Or, more to the point, your *users* pay.


On Tuesday, September 24, 2013 2:36:07 PM UTC-7, Peter Lawrey wrote:

In my experience, how you use the languafe and srructure your program matters more than the choice of language.  Where there is a difference is C++ attracts developers interested in the low level details and even in java many of the best low level developers have worked in C++.

In short, to get the best results, change the developers rather than or as well as the language.

On 24 Sep 2013 14:22, "Matt Fowles" <matt....@gmail.com> wrote:
Peter~

In my experience the biggest difference is stack allocation.  In C++ temp objects are usually on the stack.  Hence their allocation is a thread local pointer bump (like java), but their collection is trivial and requires no additional work (unlike java).  Also the fact that nested structures don't lead to pointer indirections helps too.

Honestly, with the convenience features of C++11, C++ is becoming a downright pleasant language to develop in.

Matt
On Tue, Sep 24, 2013 at 5:16 PM, Peter Lawrey <peter....@gmail.com> wrote:

I have not come across any suggestion that memory allocation in C++ is any faster than Java. What makes a difference is gc pauses.  Imho if you can live without managed memory. You can do this in java as well and eliminate gc pauses.

On 24 Sep 2013 13:37, "baboune" <nicolas...@gmail.com> wrote:
Hi,

Maybe this is a stupid question, so be patient. I think this is related to mechanical sympathy..

I was in a conference today (https://www.sics.se/events/cloud-and-big-data-day-2013-program), and Matei Zaharia (Berkely) said that they were considering re-writing spark in C/C++ in order to see if it would go faster.  He then blamed the performance cost on Java memory management.  

Once you go off-heap for the big data management parts (which he indicated spark is already doing), are the GC costs + Object costs so large as to see such a difference in performance? From a memory perspective does the JVM impose some limits on what can be done to manage memory?

Thanks

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

baboune

unread,
Sep 25, 2013, 11:44:51 AM9/25/13
to mechanica...@googlegroups.com
Hi Matei,
I had no idea that you were in this group.  Anyway I might have misheard you during the question part.  What I had understood was that you were considering  C rewriting to compare if there would be an improvement and by how much.

How is it that few developers know about DirectBuffer and off-heap techniques for Java (like unsafe)?

Kirk Pepperdine

unread,
Sep 27, 2013, 4:02:26 PM9/27/13
to mechanica...@googlegroups.com
As most here know, allocations in Java are pretty much limited to a bump on an uncontented pointer. This, until recently, wasn't the case in C++. GC pause time is an artifact of recovery, not allocation. There are so many sources of just random latency in a cloud that I question that the problem is memory management without stronger evidence.

-- Kirk

Martin Thompson

unread,
Oct 1, 2013, 4:17:25 PM10/1/13
to mechanica...@googlegroups.com, to...@lipcon.org

On Tuesday, 24 September 2013 22:35:06 UTC+1, Todd Lipcon wrote:

Native C allocation is single threaded. Java object allocation is multi threaded.

That's not true -- if you use any half decent allocator, you get thread local allocations and good concurrent performance in C++. For example, tcmalloc performs well under concurrency.
 
As soon as your start allocating memory on one thread and freeing it on another in C/C++ things get ugly. Local allocation has got much better but the freeing still has a way to go. Some memory managers need to take a lock to free the memory across CPUs.

Martin...

Gil Tene

unread,
Oct 2, 2013, 10:32:58 AM10/2/13
to mechanica...@googlegroups.com
Beyond the cross-thread freeing questions, freeing heap-allocated memory in C/C++ is fundamentally more expensive than it is in Java, C#, and other GC'ed environments. This observation is true for all memory allocations that are not arena-based, and while arena-based allocation can approximate the efficiency and low cost of management in GC based heap systems, they can only be applied (in C/C++) in very specific contexts, and do not work for regular objects that can be passed around to libraries, held in collections, etc.

The higher fundamental cost comes from the fact that there is work involved on the free()/destroy side of things in C/C++, where no such work exists in a GC'ed environment. If reference counting is involved, it is much higher in cost even when thread-local, as each pointer overwrite involves ref counting work. Even when no ref counting is used, the free operation used will at least double the per-allocation cost compared to a good evacuating collector that leverages the weak generational hypothesis.

In addition, in GC based heaps, even the much lower and amortized cost of GC'ing and efficiently recovering huge contiguous chunks of space can be moved to background threads, such that the actual executing program threads that perform the allocation do not have to pay it directly, making them faster. When plentiful cores exits, this translates into fundamentally better latency for memory management, not just better throughput.

Rüdiger Möller

unread,
Oct 2, 2013, 11:30:12 AM10/2/13
to mechanica...@googlegroups.com
For a fair real world comparation it should be taken into account that a good part of C++ allocs is on the stack. The duration of NewGen GC should be added to allocation cost in Java, as every allocation fills eden and increases frequency of newgen collections.

Kirk Pepperdine

unread,
Oct 2, 2013, 11:33:43 AM10/2/13
to mechanica...@googlegroups.com
fair enough but the allocation on heap for an evacuating collector has the same effect as an allocation on a stack for C++. And both are free to collect in the sense that the data gets thrown away as the stack unwinds or live data in a space is evacuated.

Regards,
Kirk

Martin Thompson

unread,
Oct 2, 2013, 12:13:08 PM10/2/13
to mechanica...@googlegroups.com

On Wednesday, 2 October 2013 16:30:12 UTC+1, Rüdiger Möller wrote:
For a fair real world comparation it should be taken into account that a good part of C++ allocs is on the stack. The duration of NewGen GC should be added to allocation cost in Java, as every allocation fills eden and increases frequency of newgen collections.

This is fair. In C/C++ we also need to include the cross-thread free/delete costs.

For Java to get better we need more work on escape analysis so objects get stack allocated where possible to reduce minor GC activity, and minor GC needs to happen concurrently without a stop-the-world event so we can scale up with increasing core counts. 

Martin...

Kirk Pepperdine

unread,
Oct 2, 2013, 2:05:37 PM10/2/13
to mechanica...@googlegroups.com
Very good point, the allocation of a dead object in young results in an almost free collection but it does increase frequency was as an on stack allocation may not need to be collected. However it will still need to be scanned for roots when the object becomes invisible before it goes out of scope.

-- Kirk

Gil Tene

unread,
Oct 2, 2013, 5:03:29 PM10/2/13
to mechanica...@googlegroups.com


On Wednesday, October 2, 2013 8:30:12 AM UTC-7, Rüdiger Möller wrote:
For a fair real world comparation it should be taken into account that a good part of C++ allocs is on the stack.

Fair point, but this is why I focus on heap allocation cost. Stacks are small and heaps can be big. Not many ways to use 1GB with stack allocation.
 
The duration of NewGen GC should be added to allocation cost in Java, as every allocation fills eden and increases frequency of newgen collections.

Only the pause times associated with newgen collections should be added. In current JVMs where that pause time totals 500usec spread over many seconds, Java wins big time.

Gil Tene

unread,
Oct 2, 2013, 5:20:56 PM10/2/13
to mechanica...@googlegroups.com
There are a few differences here:

1. Stack allocation is a form of arena allocation, and suffers from most of the same limitations. In addition, it is by necessity only useful for short lived thread-local things.

2. Size matters. Registers let you hold low tens of words. Stacks (in a practical sense) let you hold KBytes and maybe even MBytes, but for anything with the ability to even scratch the surface on using even 1% of a commodity server's memory capacity to track some sort of state or contents, heap allocation is the only thing that actually exists. 

3. While stack allocation patterns do reduce heap allocation, overlapping a bit with the benefits newgen collector get from discarding dead stuff with no work at all, they generally produce no more than a 3:1 or 4:1 filter against heap allocation. In contrast, "most things die young" tends to [empirically] produce a 20x-100x kind of thing regardless of heap or stack allocation patterns. Simple put: Even in C, most mallocs die young.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rajiv Kurian

unread,
Oct 2, 2013, 5:32:30 PM10/2/13
to mechanica...@googlegroups.com
Why would an arena allocator not work when used with C++ libraries or STL? What if you use LD_PRELOAD to preload say jemalloc?

Gil Tene

unread,
Oct 2, 2013, 5:44:20 PM10/2/13
to mechanica...@googlegroups.com
The practical use cases for which arena allocators work are semantically limited. They come with the implicit presumption that "when the time comes" the contents of the arena can be thrown away without looking into it or iterating on it in any way. This means that nothing outside the arena refers to anything within it at that point and that nothing in the arena refers to anything outside of it in a way that needs to be accounted for when the reference goes away. Break either of these assumptions, and you lose the efficiency of arena allocation (or end up with a corrupted heap - your choice).

Some real world problems fit into this model, but most don't.

Rajiv Kurian

unread,
Oct 2, 2013, 5:58:51 PM10/2/13
to mechanica...@googlegroups.com
Just curious, can you provide an example of a common application where stack based allocation for small short lived objects plus arena allocation for large but short lived objects is not good enough?

Gil Tene

unread,
Oct 2, 2013, 6:21:01 PM10/2/13
to <mechanical-sympathy@googlegroups.com>
Well, pretty much anything that has long lived objects. Which covers the vast majority of useful applications I can think of.

E.g. A queue. A cache. A catalog. An ESB. An in-memory index. A price book...

On Oct 2, 2013, at 2:58 PM, Rajiv Kurian <geet...@gmail.com>
wrote:

> Just curious, can you provide an example of a common application where stack based allocation for small short lived objects plus arena allocation for large but short lived objects is not good enough?
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/1TMjVjyyMmA/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Peter Lawrey

unread,
Oct 2, 2013, 6:25:09 PM10/2/13
to mechanica...@googlegroups.com
Actually arenas work with very long lived objects, its when you have a mix of very long lived and not so long lived and you can't waste the resources, then you have an issue.


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Oct 2, 2013, 6:34:40 PM10/2/13
to <mechanical-sympathy@googlegroups.com>
Yup. Trouble like the kind you would get with all the app examples I listed below.

On Oct 2, 2013, at 3:25 PM, Peter Lawrey <peter....@gmail.com>
 wrote:

Rajiv Kurian

unread,
Oct 2, 2013, 6:53:27 PM10/2/13
to mechanica...@googlegroups.com
Why can't a queue be built with arena allocation? Are the cross-thread new and frees the issue here? If it isn't a why can't a producer thread get a chunk from an arena and put it on the queue. The consumer thread processes the entry and returns the memory to the arena. If the queues are fixed size then we use a circular buffer to store the entries with arena allocation for dynamically sized buffers. This can be used to build a simple TCP server that accepts connections and read data on one thread and delegate processing to worker threads.


On Wednesday, October 2, 2013 3:21:01 PM UTC-7, Gil Tene wrote:
Well, pretty much anything that has long lived objects. Which covers the vast majority of useful applications I can think of.

E.g. A queue. A cache. A catalog. An ESB. An in-memory index. A price book...

On Oct 2, 2013, at 2:58 PM, Rajiv Kurian <geet...@gmail.com>
 wrote:

> Just curious, can you provide an example of a common application where stack based allocation for small short lived objects plus arena allocation for large but short lived objects is not good enough?
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/1TMjVjyyMmA/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Peter Lawrey

unread,
Oct 2, 2013, 6:59:09 PM10/2/13
to mechanica...@googlegroups.com

This works well but is very limited. Say you application needs some state which lives between messages in this ring buffer. It is hard to have a non trival program which is truly stateless.

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Oct 2, 2013, 7:00:32 PM10/2/13
to <mechanical-sympathy@googlegroups.com>
The stuff that is put in the queue can have (and typically does) variable, independent lifetimes, and can die (go out of scope) out of order. Think of a work queue as a simple example.

On Oct 2, 2013, at 3:53 PM, Rajiv Kurian <geet...@gmail.com>
 wrote:

Why can't a queue be built with arena allocation? Are the cross-thread new and frees the issue here? If it isn't a why can't a producer thread get a chunk from an arena and put it on the queue. The consumer thread processes the entry and returns the memory to the arena. If the queues are fixed size then we use a circular buffer to store the entries with arena allocation for dynamically sized buffers. This can be used to build a simple TCP server that accepts connections and read data on one thread and delegate processing to worker threads.


On Wednesday, October 2, 2013 3:21:01 PM UTC-7, Gil Tene wrote:
Well, pretty much anything that has long lived objects. Which covers the vast majority of useful applications I can think of.

E.g. A queue. A cache. A catalog. An ESB. An in-memory index. A price book...

On Oct 2, 2013, at 2:58 PM, Rajiv Kurian <geet...@gmail.com>
 wrote:

> Just curious, can you provide an example of a common application where stack based allocation for small short lived objects plus arena allocation for large but short lived objects is not good enough?
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/1TMjVjyyMmA/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/1TMjVjyyMmA/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Rajiv Kurian

unread,
Oct 2, 2013, 8:38:38 PM10/2/13
to mechanica...@googlegroups.com
I thought that jemalloc was a general purpose allocator that could just replace malloc.

Gil Tene

unread,
Oct 2, 2013, 9:02:55 PM10/2/13
to <mechanical-sympathy@googlegroups.com>
It is. But it's not an arena allocator. It's just a malloc/free implementation.

Arena allocations is fundamentally incompatible with malloc/free from a semantics point of view. The fact jemalloc it uses the term "arena" to describe it's internal chunk units probably leads to some confusion around it's classification....

An arena allocator is an allocator that allows efficient freeing of allocated objects "all at once". The area accumulates objects through allocations, and then the arena as a whole (not the individual objects) is freed when the arena is done with and no longer needed. Arena allocators derive their efficiency from the elimination of the per object free operations, so anything that frees individual objects or iterates over them during the free operation is not an arena allocator (at least not from efficiency benefits point of view).

On Oct 2, 2013, at 5:38 PM, Rajiv Kurian <geet...@gmail.com> wrote:

> I thought that jemalloc was a general purpose allocator that could just replace malloc.
>

Rajiv Kurian

unread,
Oct 3, 2013, 12:02:14 AM10/3/13
to mechanica...@googlegroups.com
Aah I see - thanks for explaining.
> To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kirk Pepperdine

unread,
Oct 3, 2013, 2:03:48 AM10/3/13
to mechanica...@googlegroups.com
The problem is when they get mixed. You maybe able to use application semantics to sort long from short and maybe there is some of this that can be captured with escape analysis but as soon as you can't capture it all and you end up with mixing.... it all fall apart.

Howard Chu

unread,
Oct 3, 2013, 7:58:50 AM10/3/13
to mechanica...@googlegroups.com


On Wednesday, October 2, 2013 7:32:58 AM UTC-7, Gil Tene wrote:
Beyond the cross-thread freeing questions, freeing heap-allocated memory in C/C++ is fundamentally more expensive than it is in Java, C#, and other GC'ed environments. This observation is true for all memory allocations that are not arena-based, and while arena-based allocation can approximate the efficiency and low cost of management in GC based heap systems, they can only be applied (in C/C++) in very specific contexts, and do not work for regular objects that can be passed around to libraries, held in collections, etc.

 re: passing objects to libraries - this *should* not be an issue. In particular, libraries cannot safely free memory that was given to them by an outside caller - the library has no way to know how the memory was allocated. Anyone who's spent any amount of time working in Windows should already be familiar with this - every DLL is linked independently to the C runtime, memory allocated by malloc() in one DLL cannot be freed by free() in any other module because they're all using independent heaps. It must all be freed by the same module that allocated it. Or alternatively, every library provides hooks for the app to install its own malloc/free handlers, and only one module's malloc/free implementation ever gets used.

So - either you coordinate malloc/free implementations, or libraries are forbidden from freeing memory that is passed to them. And generally, apps are forbidden from freeing memory returned from a library, for the same reasons; the library must provide its own destructor entry point for such things. Given these rules, which are non-negotiable, it's perfectly fine to use an arena-based allocator in C/C++ apps.

The issue of differing object lifetimes is annoying. My approach has been to use two sets of allocators, one for short-lived objects and one for longer. "short" vs "long" is simple - anything that doesn't need to live beyond this function is short. It could usually be satisfied using alloca() (stack allocation) but we tend to avoid alloca() because it's nonportable. Anything that needs to be returned to a caller or otherwise outlive the function is "long" lifetime. Anything that needs to live beyond a current thread just uses regular malloc().

Gil Tene

unread,
Oct 3, 2013, 10:02:43 AM10/3/13
to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com
Allocator-frees is not the only contract in wide use in the C/C++/Obj-C variant world...

There are plenty of examples of other memory contracts across libraries. Most of them use ref counting forms. E.g. retain/release variants (as in objective C) and smart pointers (in which common code can use recounting or not depending on smart pointer type and scope or context). These are needed and used for containers and collections of things that come from a wide range of library code. Boost, C++11, and smart pointers combined with STL are all pretty common uses of this stuff.

Memory allocation/freeing contracts are a bitch to manage in a non-GC world, but that doesn't make the code slower or faster. It just makes it harder, uglier, buggier, less-reuseable, or maybe just requires-more-IQ-reputation-points-to-write-well. The consistent use of retain/release and even ARC in the rich sort of modern objective C libraries is a great example of how far library code can go without having to take on an allocator-frees contract, or some strange receiver-frees contract variants. But it's still funny to watch how people try to ignore the possibility or need for the simplest of accidental circular referencing, and consider it "the programmer's fault" if your naturally represented data structures can end up creating things like graphs-of-hash-tables-of-linked-lists-that-may-hold-pointers-to-graphs.

Sent from my iPad

Alexander Turner

unread,
Oct 5, 2013, 11:09:38 AM10/5/13
to mechanica...@googlegroups.com
It is fair to point out that the very fastest approach is not to allocate at all? This is one reason that F77 and older COBOL programs run so very fast. Further, talking about any form of heap allocation in low latency code is poor practice. Low latency code should not have any heap allocation unless that allocation is done outside of the low latency code segments.

The discussion of the performance difference between Java and C for memory management never make any more sense. Either memory speed is critical in which case one should not use a heap at all, or it is not in which case it does not matter.

Gil Tene

unread,
Oct 5, 2013, 11:28:42 AM10/5/13
to <mechanical-sympathy@googlegroups.com>
Lets not confuse latency with low latency here. 

Allocation (and freeing) latency matters a lot to human-reponse time apps, and even to batch apps, because allocation latency naturally accumulates in many-msec operations that do non-trivial work, making differences in memory management speed amount to observable throughput, response time, and CPU use differences.

When it comes to low latency (e.g. in sub-50usec) apps, I would generally agree that allocation on the latency critical path in a hyper-low-latency app should be pushed off to other threads as much as possible. But whether or not rapid allocation exists in your low latency process can depend on the problem at hand, and on how hard or smart you want your low latency code to be. For example, when your low latency code path is able to consult many-GB of in-process useful data sets that rapidly change in response to real world behavior, it can often beat the pants off and the take the lunch money from other low latency code that might be trying to react by jerking it's knee based on yesterday's knowledge.  

Doing this sort of thing on the low latency path can often be achieved with a few efficient lookups and few cache misses even when leveraging massive data, leading to smart decisions made in microseconds or sub-microsecond speeds. But you are correct that most of the allocation and mutation work in such cases can and should be done in background threads in such a system.


On Oct 5, 2013, at 8:09 AM, Alexander Turner <nerdsc...@gmail.com>
 wrote:

Rajiv Kurian

unread,
Oct 7, 2013, 12:10:44 AM10/7/13
to mechanica...@googlegroups.com, to...@lipcon.org
A bit off topic, but are there any best practices when it comes to allocating on one thread and freeing on another? Any custom allocators (C/C++)or just patterns applicable to other languages too?

Martin Thompson

unread,
Oct 7, 2013, 2:52:18 AM10/7/13
to mechanica...@googlegroups.com, to...@lipcon.org
When you have a clear exchange from one thread to another then a ring buffer with sequence counters to denote usable positions can work very well. Memory never gets free'd, it is just recycled. This can also work very well with the hardware prefetchers.

Rajiv Kurian

unread,
Oct 7, 2013, 10:32:21 AM10/7/13
to mechanica...@googlegroups.com
Thanks Martin. Preallocation works really well if the sizes are well known and uniform. What if the application is a server processing non uniform data like images/videos which can vastly vary in size?

Martin Thompson

unread,
Oct 7, 2013, 10:35:25 AM10/7/13
to mechanica...@googlegroups.com
It can be a circular buffer of bytes augmented by an circular index describing records and their respective sizes.

Peter Lawrey

unread,
Oct 7, 2013, 10:37:02 AM10/7/13
to mechanica...@googlegroups.com

Variable length isnt a problem if you are always allocating. The problem arises from fragmentation from de-allocating and allocating.  (Fixed size records dont have the same fragmentation issue)

On 7 Oct 2013 15:32, "Rajiv Kurian" <geet...@gmail.com> wrote:
Thanks Martin. Preallocation works really well if the sizes are well known and uniform. What if  the application is a server processing non uniform data like images/videos which can vastly vary in size?

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rajiv Kurian

unread,
Oct 7, 2013, 11:23:18 AM10/7/13
to mechanica...@googlegroups.com
What I mean is say I use one ring buffer plus indexes in it to allocate memory chunks, and another ring buffer to actually send these chunks over to another thread. This second ring buffer structs just contain the indices into the first one along with some other data. The natural way to recycle memory is when the second ring buffer (the one used to do inter thread communication) entries get recycled claim those indices back and reuse them. But if I say allocate a 900 MB chunk my worker thread could be done with it in a second, but I would only be able to reuse it after the entire ring buffer recycles. This could cause me to run out of memory even though I have lots of free memory. If I have another data structure where the worker thread writes what indices it is done with then maybe I can consult that every time I run out of memory on the first thread and reclaim those indices.

Peter Lawrey

unread,
Oct 7, 2013, 11:31:30 AM10/7/13
to mechanica...@googlegroups.com

If you have a ring buffer you can free the data for each message after you process it. Ie you maintain a write pointer and a free bytes pointer. When you add a message you can use from the write pointer to the free bytes pointer. (Using clock arithmetic)

Peter.

On 7 Oct 2013 16:23, "Rajiv Kurian" <geet...@gmail.com> wrote:
What I mean is say I use one ring buffer plus indexes in it to allocate memory chunks, and another ring buffer to actually send these chunks over to another thread. This second ring buffer structs just contain the indices into the first one along with some other data. The natural way to recycle memory is when the second ring buffer (the one used to do inter thread communication) entries get recycled claim those indices back and reuse them. But if I say allocate a 900 MB chunk my worker thread could be done with it in a second, but I would only be able to reuse it after the entire ring buffer recycles. This could cause me to run out of memory even though I have lots of free memory. If I have another data structure where the worker thread writes what indices it is done with then maybe I can consult that every time I run out of memory on the first thread and reclaim those indices.

Rajiv Kurian

unread,
Oct 7, 2013, 1:00:12 PM10/7/13
to mechanica...@googlegroups.com
Thanks Peter And Martin. I get the part about maintaining a circular buffer with a write pointer and a free bytes pointer or an index of free ranges. Curious about what freeing entails though. Say the worker thread processes a message and is done so it doesn't need a particular range of bytes any more. How would you guys suggest signalling to the producer thread that a particular range is not needed any more so that it can be re used by the producer thread. 

My initial thought is that worker maintains an index of ranges that it doesn't need any more. When the producer thread runs out of space or maybe every now and then it checks this index (always written by the worker thread and only read by the producer thread) to see what ranges it can re use. It can reclaim these ranges back at such a time.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,
Oct 7, 2013, 1:05:22 PM10/7/13
to mechanica...@googlegroups.com
The consumer/worker has a counter indicating the position in the circular buffer which is incremented for the size of the item that has been processed/consumed. The producer has a counter saying how many bytes are produced and the consumer has a corresponding counter saying how many bytes are consumed.

Rajiv Kurian

unread,
Oct 7, 2013, 1:23:36 PM10/7/13
to mechanica...@googlegroups.com
Thanks Martin. That actually works a lot better.

awei...@voltdb.com

unread,
Oct 8, 2013, 9:10:56 AM10/8/13
to mechanica...@googlegroups.com
One technique that I think hasn't been brought up yet is simple segregated storage. To make it thread-safe you can use a thread-local storage pools when and when you go to release from a different thread you can have the allocator check and batch up a few different items for return to the allocating thread in a single message. The allocating thread can check for messages inside the allocator in the next malloc/free. This solves the issue of using many gigabytes of memory and complex object graphs across threads with predictable allocation time and overhead. The only thing it doesn't solve is fragmentation which it is subject to if the distribution of allocations changes over time.

I think stack allocation is still useful because you get locality within a unit of work and the per object overhead is lower on the free side.

Ariel
Reply all
Reply to author
Forward
0 new messages