A good off-heap library/framework? Preferably Unsafe ;)

2,413 views
Skip to first unread message

cosmicvisitors (In my code)

unread,
Feb 20, 2015, 4:45:46 PM2/20/15
to
Dear mechanical-sympathisers! 

I'm hitting some GC limits in my app; popularity has scaled it to the point where there are just too may objects created, creating a linear slowdown with the good-ol' tracing algorithm running over the old gen. 
However, I think there is an easy way out- my application uses a lot of caching, and up to this point it's all vanilla HashMaps and LinkedHashMaps. 

So I was thinking of refactoring much of this, using off-heap techniques.
I can probably write my own, adapting excellent articles such as Martin's (http://mechanical-sympathy.blogspot.co.uk/2012/10/compact-off-heap-structurestuples-in.html), and Nitsan's (http://psy-lob-saw.blogspot.co.uk/2013/04/lock-free-ipc-queue.html)  but was wondering if there is some framework/library out there, packaging a tested solution in a nice open source form, hopefully leniently licensed. Why re-invent the wheel if there is a more bug-free solution out there? 

Ideally this would use Unsafe, or even better, give you an option of using Unsafe vs ByteBuffers vs something else, for compatibility's sake.
But to be honest, I don't see myself using any other VM than Oracle, so this is really just a nice to have. 

And if I was reaching the 6th percentile: peer-reviewed benchmarks, a good simple or java.util.Map API, or variety of algos (linked maps, hash maps, etc) are a definitive bonus!
But really, anything that would be an good foundation to build upon, is also welcome.

Many thanks!

Martin Thompson

unread,
Feb 20, 2015, 5:02:41 PM2/20/15
to mechanica...@googlegroups.com
Probably worth a play with some of Richard Warburton and Peter Lawery's work.



Martin...


On 20 February 2015 at 21:45, cosmicvisitors (In my code) <salam...@gmail.com> wrote:
Dear mechanical-sympathisers! 

I'm hitting some GC limits in my app; popularity has scaled it to the point where there are just too may objects created, creating a linear slowdown with the good-ol' tracing algorithm running over the old gen. 
However, I think there is an easy way out- my application uses a lot of caching, and up to this point it's all vanilla HashMaps and LinkedHashMaps. 

So I was thinking of refactoring much of this, using off-heap techniques.
I can probably write my own, adapting excellent articles such as Martin's (http://mechanical-sympathy.blogspot.co.uk/2012/10/compact-off-heap-structurestuples-in.html), but was wondering if there is some framework/library out there, packaging a tested solution in a nice open source form, hopefully leniently licensed. Why re-invent the wheel if there is a more bug-free solution out there? 

Ideally this would use Unsafe, or even better, give you an option of using Unsafe vs ByteBuffers vs something else, for compatibility's sake.
But to be honest, I don't see myself using any other VM than Oracle, so this is really just a nice to have. 

Peer-reviewed benchmarks, a good simple or java.util.Map API, or variety of algos (linked maps, hash maps, etc) are a definitive bonus!

Many thanks!

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,
Feb 20, 2015, 10:46:20 PM2/20/15
to
Not that efficient and off-heap HashMaos alternatives are wrong, but I'd double check your problem first:

When you say "...there are just too may objects created, creating a linear slowdown with the good-ol' tracing algorithm running over the old gen.", do have numbers to show that?

If your only problem is that the tracing algorithm for the Oldgen marker is taking more CPU, the simplest solution for that is to increase the Olgen size without growing the amount of heap you actually use. E.g. for each doubling of the oldgen size, the efficiency of the tracng algorithm will double.

A thrashing oldgen marker is most often a symptom of trying to use too much of a small heap, and growing your heap size is the easiest way to avoid that thrashing..

On the other hand, if your problem is not slowdown due to oldgen taking too much CPU, but pauses (which tend to grow lnearly with heap size or live set size on pausing oldgen collectors) then taking things off the heap or using a collector they doesn't pause no matter what the heap size is are your two main options.

Jan van Oort

unread,
Feb 21, 2015, 2:14:27 AM2/21/15
to mechanical-sympathy
Peter Lawrey's ChronicleMap. 



Fortuna audaces adiuvat - hos solos ? 

On 21 February 2015 at 04:46, Gil Tene <g...@azulsystems.com> wrote:
Not that efficient and off-heap HashMaos alternatives are wrong, but I'd double check your problem first:

When you say "...there are just too may objects created, creating a linear slowdown with the good-ol' tracing algorithm running over the old gen.", do have numbers to show that?

If your only problem is that the tracing algorithm for the Olgen marker is taking more CPU, the simplest solution for that is to increase the Olgen size without growing the amount of heap you actually use. E.g. for each doubling of the oldgen size, the efficiency of the tracng algorithm will double.

A thrashing oldgen marker us most often symptom of trying to use too much of a small heap, and gleowing re heap surge easiest way to avoid that thrashing..

On the other hand, if your problem is not slowdown due to oldgen taking too much CPU, but pauses (which tend to grow on early with heap size or live set size on pausing oldgen c, then taking things off the heap or using a collector they doesn't pause no matter what the heap size is are your two main options.

cosmicvisitors (In my code)

unread,
Feb 21, 2015, 6:01:48 PM2/21/15
to mechanica...@googlegroups.com
Hi,

Excuse my ignorance, I don't understand how increasing the old gen size would increase the GCs throughput and as such halve its workload as you mention.

We would have the same number of GC threads, and perhaps a somewhat skewed mark initiating trigger (I use a percentage ratio for initiation, disabling other heuristics).

So where would the improvement come from?

Thanks!

Rick Hightower

unread,
Feb 22, 2015, 4:32:33 AM2/22/15
to mechanica...@googlegroups.com
I lack the confidence to say what Gil said, but that is what I was thinking.

Some recent experience with an in-memory service data store using G1 garbage collectors (heap size only 30 GB so not HUGE), we did not see very much GC overhead, even under heavy load. (We update in periodic batches so maybe not exactly like your use case, and most does not update, and the communication tier gets collected really nicely by generational collector chocalately richness (most of our communication object do die young and the JVM did the right thing, and the rest does stay around a long time so again the generational goodness was working well.)

Later we switched to off heap using byte buffer (not unsafe, but I was tempted as we used it quite a bit for JSON serializing) and we did not see that much difference (not even premature optimization, we had no problem to solve per se, but just decided to post mature optimize it anyway because... we could.. well that is not entirely true, we started storing UTF-8 JSON opaque values in byte arrays instead of String and figured WTH might as well put them in off heap buffers). Anyway, I saw no real difference in GC overhead, but that was our use case.

I saw some collection libs a few years ago that had a Map and what not done in both byte buffer and unsafe, but could not re-find it. It was a project on google code back when there were a lot of project on google code (so before github became the beast that it is).




--
Rick Hightower
(415) 968-9037
Profile 

cosmicvisitors (In my code)

unread,
Feb 22, 2015, 5:20:25 AM2/22/15
to mechanica...@googlegroups.com
Hi Gil

I'm not sure I understand how doubling the old gen size would halve the marking overhead. Could you elaborate?

As far as I understand the GC mechanics (using ParNew and CMS) I would still have the same number of GC threads, and probably observe a skew in the initiating occupancy trigger (I use a percentage trigger only, no heuristics).

Is it because due to the extra size, the GC threads have more time to process the graph?

Thanks!

Vitaly Davidovich

unread,
Feb 22, 2015, 11:51:50 AM2/22/15
to mechanica...@googlegroups.com

You'd probably want to reduce the initiating occupancy threshold if you doubled the old gen.  But yes, the idea is that your (larger) old gen would provide a bigger buffer for holding promoted (but not eternal) objects, and the lower initiating threshold may help the concurrent GC threads keep up with your promotion rate.  This assumes you're not hitting full gc due to fragmentation.

sent from my phone

Gil Tene

unread,
Feb 22, 2015, 1:23:46 PM2/22/15
to
Here is how the math works for this:

To start, let me highlight the fact that GC overhead (or GC related slowdown) and GC pauses are two very different (and completely orthogonal) things. People often confuse them, and I often hear people say they want to reduce GC overhead/slowdowns when they really want to reduce GC pauses. But assuming what you are looking to do is reduce the slowdown in throughput caused by GC tracing work (per the original question), the math is pretty simple:

GC overhead or GC related slowdown is all about the amount of work GC does as a percentage of overall work that your program does. It's easiest to measure as the % of CPU cycles spent on GC [out of the overall cycles spent by the process]. With that in mind here are some simple statements that are true for all GC tracers I know of:

- Tracing work is linear to the live set size ("LiveSetSize" in the math below) [*1].

- The frequency with which tracing needs to be done is determined solely by how often the generation in question gets filled with objects to the point where a GC cycle needs to be kicked off. E.g. for OldGen, the fill rate is the promotion rate [*2], and the frequency with which tracing needs to happen is therefore linear to (PromotionRate / (OldGenSize - LiveSetSize)). [*3]

- The overall work per unit of time done by tracing is therefore linear to ( LiveSetSize * PromotionRate / (OldGenSize - LiveSetSize) ).

- For a given workload (which determines a given LiveSet and PromotionRate), the amount of work done by the tracer is inversely proportional to (OldgenSize - LiveSet).

A simpler way to state this is: The live set size determines how much work is involved in each tracing operation. But the amount of empty heap determines the frequency with which that amount of work needs to occur. For every doubling of the empty heap in oldgen (OldGenSize - LiveSetSize), the amount of tracing work per unit of time will be cut in half.

Note that this is a statement about the tracing part of the garbage collection cycle. The entire GC cycle work is linear to live set in Copying and pure Mark/Compact collectors [e.g. evacuating compacting collectors where the compaction work is also linear only to the live set, and is unaffected by heap size], but in Mark/Sweep/Compact collectors (like ParallelGC and CMS's FullGC cycles) the GC cycle work does grow with the heap size as well (for the sweep parts). Even so, the efficiency of the collector (in terms of % CPU cycles spent on GC) will grow with the heap size, even if not in a perfectly linear way.

This may all make you ask the obvious question: "If the math is so beneficial, why not just throw lots of cheap empty memory at the problem?". The answer is that for GC efficiency math, empty memory solves everything. But when collectors pause for amounts of time related to the size of the heap, the gain in efficiency comes at the cost of increased [individual] pause times. Since most applications have some [fuzzy but absolute] limits on tolerable pause times, this "throw empty heap at the problem" doesn't really work above a certain point for pausing collectors. E.g. keeping an actively modified 1GB live set LRU cache in a 300GB heap may be ultra efficient, but if it occasionally pauses for a debilitating amount of time (large enough to be be better named a "crash" rather than a "pause" in  may cases), it won't be deployed that way no matter how efficient things are.

The benefit for collectors that do not grow either the GC work done per cycle OR the pause time magnitudes as the empty heap grows (e.g. C4) is that you get to use this simple efficiency math all you want. E.g. most Zing users "tune" the collector (by simply setting the heap size) such that [the concurrent] GC is active for no more than 1-5% of overall wall clock time, and stop there. They could obviously go farther by doubling their heap size again and again, but it appears that at that efficiency level nobody cares enough about the background CPU cycles spent to spend any more memory to reduce them...

[*1]: Tracing work for all STW tracers and for some concurrent tracers [e.g. C4's] is linear only to the live set. But in some concurrent tracers [e.g. CMS and G1] the work also grows with the reference mutation rate, which is generally linear to application throughput. I ignore this throughput-related work contribution in the math for above discussion.

[*2]: To be precise, when I use the term "promotion rate" I include any allocation into the OldGen. In collectors where some larger object arrays may be directly allocated in the OldGen [which is true for most HotSpot collectors], "promotion rate" includes the allocation rate of such large objects. [This is not the case for all collectors. E.g. for C4, where any size object would still remain in newgen until promoted]. In any case, this does not change the math above, just clarifies the terminology.

[*3]: Occupancy [where it is used] threshold is simply a constant multiple on "OldGenSize" in this formula but it doesn't really affect the basic relationship or efficiency math beyond that constant multiple relationship. Occupancy threshold is used only by some mostly-concurrent tracers (e.g. CMS and G1). Others (e.g. C4) don't apply a constant multiple to the heap size, and instead use triggering math based on {allocation-rate, heap-size, GC-cycle-times} that aim to kick-off of a GC cycle such that the heap would become much closer to "full" before GC frees up stuff.

osu...@gmail.com

unread,
Feb 22, 2015, 3:18:25 PM2/22/15
to mechanica...@googlegroups.com
Surprisingly few well-packaged off-heap libs out there, try mapdb perhaps
http://www.mapdb.org/

Scott Carey

unread,
Feb 23, 2015, 12:25:28 PM2/23/15
to mechanica...@googlegroups.com


On Friday, February 20, 2015 at 11:14:27 PM UTC-8, Jan van Oort wrote:
Peter Lawrey's ChronicleMap. 


I have personally had better luck with MapDB.  With my application (heavily read dominated) Chronicle performed the same, but had all sorts of sizing problems leading to annoying runtime exceptions when various parameters were not pre-configured well.   Beyond the basic map use case, these libraries have many disjoint features so depending on the use case, they may not compete at all.  And others seem to have had a better experience with Chronicle than I have.

MapDB is roughly equal in performance to a j.u.TreeMap, except off-heap, thread-safe, and persisted to disk if needed.

Neither of these explicitly answers the OP question -- I know of much code that uses unsafe, often with the ability to fall back to direct byte buffers or on-heap, but no general purpose helper library for it.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rick Hightower

unread,
Feb 23, 2015, 2:11:20 PM2/23/15
to mechanica...@googlegroups.com
I have had good luck with this:


(I wrote it, but not the part that is cool.)

It used to wrap RocksDB, LMDB and/or LevelDB.

Now it just works with LevelDB since for a while it was hard to get good JNI / maven pom support for RocksDB and LMBD. (but RocksDB will be back the very next time I need it).

The part that is cool in my mind is the LevelDB/Java wrapper (and RocksDB/Java when I redo it)...... 

LevelDB allows for a large in-memory off Java heap cache backed by disk. 

It seems (to me) like a cheap and dirty way to get off heap goodness with a battle tested C lib. 

I have used it at scale. Sharded a service store with each shard running an in-memory store.

I wonder how MapsDB would compare against Java/LevelDB or Java/RocksDB. 







To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Barker

unread,
Feb 23, 2015, 2:47:22 PM2/23/15
to mechanica...@googlegroups.com
We tried LevelDB for a project that has a high volume of writes and it fell apart, mainly due to the web publicised stalling behaviours seen in LevelDB.  We (at the moment) have a custom log based store, which works well, but are looking a replacing it with LMDB, which doesn't suffer from the same compacting/stalling problems that we've seen with LevelDB.

Mike. 

Rick Hightower

unread,
Feb 23, 2015, 3:04:38 PM2/23/15
to mechanica...@googlegroups.com
The writes were written in large batches. And we had five server shards running LevelDB writing to SSDs. We had a secondary in-memory representation of the data and a large LevelDB read/write cache. (It was over engineered but failure was not an option. There being a private back wall with blood stains about head level where the last engineers failed to deliver at scale).

We were writing in large volumes. But I guess that depends on what you consider large. I think peak load was 400K writes sharded/spread across five servers. Hmmm... Since we had another transaction log, we tended not to write every change... So it was probably more like 100K writes spread across five servers (we load tested with three servers at that scale). We did not see an issue at this volume.

We tested the hell out of it. And tweaked things quite a bit to get it to perform. 


For this use case, LevelDB seemed to do well. We replaced a commercial product because we could not get it to stream data to the service tier fast enough. (We considered bringing in their engineers to help us use their product but past experience with bringing in the sw vendor to help us out of a tight jam usually ended up with a) not getting out of the tight jam b) a lot of additional consulting bills and sw licenses and tons of meetings). The schedule was too tight. 


I described the general concept / flow of the use case here:



I wish I could say it was a great solution. It did well the first few releases but their were some mistakes made (accumulation of tech debt). It was a glorious thing with a damn near flawless release and some awesome up time under load until the business decided maybe this is not what we want after all. The tech took on a lot of tech debt mostly due to lack of resources and one can only miss so many hours of sleep.

If the project gets renewed next year (some doubts), we will probably switch to RocksDB not because we had the issue you are describing with LevelDB but because we heard of it to and would like to avoid becoming a red stain on the back wall. 


Vitaly Davidovich

unread,
Feb 24, 2015, 9:36:35 AM2/24/15
to mechanica...@googlegroups.com

It will increase your application's throughput, not GC.  The assumption is that tracing work is proportional only to the live set size, and not total size of a generation.  If you double old gen but keep promotion rate and live set size the same, we know (based on earlier sentence) that GC, when it kicks in, will not have to spend more time than before (because live set size is the same).  By doubling the old gen then, you allow for more (ultimately dead) objects to be promoted before GC kicks in; in fact, you double the number of them allowed.  This means GC runs half as frequently, yet each run of it spends the same amount of time as before (again, because live set size is the same).

sent from my phone

David Hagar

unread,
Feb 24, 2015, 1:21:41 PM2/24/15
to mechanica...@googlegroups.com
An option for a good off-heap library is slice: 


From the team working on facebook's presto project. 

-David
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages