Running Voldemort on Extra Large Heap (32GB t0 40 GB)

Feng Wu

unread,

Jan 22, 2010, 3:59:57 PM1/22/10

to project-...@googlegroups.com

Hi,

Right now at my place we've build a back end service application using Voldemort as storage instead of a RDBMS. We are currently using 10 GB heap for Voldemort, with 4 GB BDB cache. The rest of the memory settings are pretty much straight from Project Voldemort's configuration page. The challenge we are facing right now, as far as I can see, is whenever BDB is hitting disk, it causes Voldemort to slow down, and that in turn causes our client to time out as we need maintain certain SLA. I've seen raw BDB store operations in the hundreds, and occasionally close to 1 second.

Another part of the IO contention comes from the fact that we are running Voldemort along side our application. On each box we have two instance of our application with one Voldemort ode. Our application does pretty heavy data logging to local disk, where BDB files also reside.

My thinking is to run Voldemort with much more RAM, say 32 GB Heap, with ~12 GB of BDB cache. The idea is trying to keep all data in memory so BDB don't need to hit disk often. And eventually I am thinking move Voldemort to its own dedicated cluster, separate from our application that it supports, this way I am hoping to be able to independently scale either my application or Voldemort.

My questions are 1) Are those reasonable ideas? and 2) Do you have any experience running such a large heap, what kind of GV tuning are needed to ensure we don't get into GC pause death?

In production we have a 10 nodes Voldemort cluster, with replication factor at 4, requires R/W at 2 and preferred R/W at 3. We are currently on version 0.51. We are evaluating 0.6.0.1 right now, but I am hoping to actually move to 0.7 as soon as it comes out.

Thanks.

-Feng

bhupesh bansal

unread,

Jan 22, 2010, 4:40:31 PM1/22/10

to project-...@googlegroups.com

Hey Feng,

We are indeed running JVMs of about 18G with 10G to bdb cache in production. The bdb disk file sizes are about 120G (varies from 108G - 135G due to lazy cleaning in bdb)
the performance we have seen under peak load is
QPS (node) 1450, QPS(cluster) = 4300
with Get timings : 1ms , 36 ms , 70 ms (for 50% , 95% 99%)
Put timings: 0.5 ms ,1 ms, 2 ms

The GC settings we use are same as on voldemort configuration page except the JVM size
http://project-voldemort.com/configuration.php

Best
Bhupesh

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.

Feng Wu

unread,

Jan 22, 2010, 5:18:12 PM1/22/10

to project-voldemort

Hi, Bhupesh,

Thanks for the stats. Just curios, what is your cluster size, and
what are the replication factor, required/prefered R/W settings?

The GET and PUT timings, is that raw BDB operation numbers? Are those
numbers exposed somewhere via a MBean? Do I have to turn on voldemort
server side debugging to get these raw performance numbers first?

Finally, what do you think of the idea of running Voldemort on its own
cluster, instead of along with our application on the same box? I'd
imagine it's probably the norm to run applications and voldemort on
separate machines, i.e., what we did is kind of an exception rather
than the norm.

Thank you very much.

-Feng

On Jan 22, 1:40 pm, bhupesh bansal <bbansal....@gmail.com> wrote:
> Hey Feng,
>
> We are indeed running JVMs of about 18G with 10G to bdb cache in production.
> The bdb disk file sizes are about 120G (varies from 108G - 135G due to lazy
> cleaning in bdb)
> the performance we have seen under peak load is
> QPS (node) 1450, QPS(cluster) = 4300
> with Get timings : 1ms , 36 ms , 70 ms (for 50% , 95% 99%)
> Put timings: 0.5 ms ,1 ms, 2 ms
>
> The GC settings we use are same as on voldemort configuration page except
> the JVM size http://project-voldemort.com/configuration.php
>
> Best
> Bhupesh
>

> > project-voldem...@googlegroups.com<project-voldemort%2Bunsu...@googlegroups.com>

Tatu Saloranta

unread,

Jan 22, 2010, 10:51:15 PM1/22/10

to project-...@googlegroups.com

On Fri, Jan 22, 2010 at 12:59 PM, Feng Wu <feng...@gmail.com> wrote:
> Hi,
>
...

> My thinking is to run Voldemort with much more RAM, say 32 GB Heap, with ~12
> GB of BDB cache. The idea is trying to keep all data in memory so BDB don't
> need to hit disk often. And eventually I am thinking move Voldemort to its
> own dedicated cluster, separate from our application that it supports, this
> way I am hoping to be able to independently scale either my application or
> Voldemort.

With that much free memory for underlying OS, doesn't it cache access
already? (true for linux, I'd assume most unixes).
Using JVM memory for caching can be rather expensive (from GC perspective).
OS Disk cache should remove/reduce I/O, even though system calls are
still needed.

Also, "native" BDB (not JE one) manages its cache outside of JVM, so
theoretically it might be less prone to death-by-GC. Assuming it does
not suffer from poor concurrency -- JE is said to be superior for high
concurrency cases, and Voldemort would seem like one, if read/write
ratio is not huge.

>
> My questions are 1) Are those reasonable ideas? and 2) Do you have any
> experience running such a large heap, what kind of GV tuning are needed to
> ensure we don't get into GC pause death?

:-)

This is indeed problematic. Perhaps the latest and greatest G1
("garbage first") could help. You probably want a concurrent
collector, to try avoid old gc at almost any cost. But that is easier
said and done, sometimes seems to be "big crash every couple of hours"
vs "death of thousand cuts".
I had enough problems with 1 gig cache (for a message queue engine) so
that I would be bit scared about huge caches.

-+ Tatu +-

Alex Feinberg

unread,

Jan 23, 2010, 1:50:22 AM1/23/10

to project-...@googlegroups.com

Hey Feng,

While, I am not familiar with specifics of application, it's usually
a good idea to separate out the application from the storage system.
At least, as the memory access patterns are going to be fairly
different between the storage system and the application (lots of
transient objects in the application, lots of permanent objects in the
storage system, application being -- generally -- CPU intesive,
storage system being memory and I/O intensive) it would make more
sense to run them in a different container/JVM. This way, there's
added fault tolerance: if a JVM running the storage system experiences
a pause/crash on one machine, the application on this machine still
remains available and can merely query the storage system on another
machine and vice-versa.

Here are the full settings that we're using in production with a 10GB
BDB cache, 18GB JVM heap. Note, these didn't take a particularly
lengthy amount of tuning and we do not experience garbage collection
issues with Voldemort:

-Xms18g
-Xmx18g
-XX:NewSize=2048m
-XX:MaxNewSize=2048m
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70
-XX:SurvivorRatio=2

We have multiple Voldemort clusters in production for both read-only
and read-write data, several per-team, with different replication
factors. The cluster I pasted the settings for has ~6 nodes. We're
presently in the middle of rolling our rebalancing and expanding this
specific cluster, to improve the ratio of RAM to disk.

Note, however, as Tatu suggested, the operating system page cache can
be *very* efficient. We haven't yet tested the possibility that large
BDB cache size may, in a way, be competing with the operating system
page cache in *some* scenario.

Thanks,
- Alex

> To unsubscribe from this group, send email to project-voldem...@googlegroups.com.

ijuma

unread,

Jan 23, 2010, 2:24:58 AM1/23/10

to project-voldemort

On Jan 22, 8:59 pm, Feng Wu <fengw...@gmail.com> wrote:
> My thinking is to run Voldemort with much more RAM, say 32 GB Heap, with ~12
> GB of BDB cache.

One thing you can try is to use -XX:+UseCompressedOops with JDK6u18
(don't try it on older releases) as that can help you fit more in the
same amount of memory (particularly if the application is reference-
heavy). It basically changes the size of a reference in the Java heap
to 32-bits in a 64-bit JVM. The maximum heap size where that option
works is 32GB. I wrote a bit about it a while ago:

http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

More recently, I included that option in a set of micro-benchmarks and
it had a very positive effect on performance too:

http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-performance/

Best,
Ismael

AshwinJay

unread,

Jan 24, 2010, 9:58:25 PM1/24/10

to project-voldemort

It's very encouraging to see people running 18Gig heaps successfully.
I've heard that the G1 collector is still slower (and buggier) than
CMS. Hopefully it's just a matter of time before G1 becomes faster and
better.

Until then, -XX:+UseLargePages and -XX:+UseTLAB will help for *nix
platforms. (More: http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp)

Ashwin.

> On Fri, Jan 22, 2010 at 2:18 PM, Feng Wu <fengw...@gmail.com> wrote:
> > Hi, Bhupesh,
>
> > Thanks for the stats. Just curios, what is your cluster size, and
> > what are the replication factor, required/prefered R/W settings?
>
> > The GET and PUT timings, is that raw BDB operation numbers? Are those
> > numbers exposed somewhere via a MBean? Do I have to turn on voldemort
> > server side debugging to get these raw performance numbers first?
>
> > Finally, what do you think of the idea of running Voldemort on its own
> > cluster, instead of along with our application on the same box? I'd
> > imagine it's probably the norm to run applications and voldemort on
> > separate machines, i.e., what we did is kind of an exception rather
> > than the norm.
>
> > Thank you very much.
>
> > -Feng
>
> > On Jan 22, 1:40 pm, bhupesh bansal <bbansal....@gmail.com> wrote:
> >> Hey Feng,
>
> >> We are indeed running JVMs of about 18G with 10G to bdb cache in production.
> >> The bdb disk file sizes are about 120G (varies from 108G - 135G due to lazy
> >> cleaning in bdb)
> >> the performance we have seen under peak load is
> >> QPS (node) 1450, QPS(cluster) = 4300
> >> with Get timings : 1ms , 36 ms , 70 ms (for 50% , 95% 99%)
> >> Put timings: 0.5 ms ,1 ms, 2 ms
>
> >> The GC settings we use are same as on voldemort configuration page except

> >> the JVM sizehttp://project-voldemort.com/configuration.php

ijuma

unread,

Jan 25, 2010, 2:30:48 AM1/25/10

to project-voldemort

On Jan 25, 2:58 am, AshwinJay <ashwin.jayaprak...@gmail.com> wrote:
> It's very encouraging to see people running 18Gig heaps successfully.
> I've heard that the G1 collector is still slower (and buggier) than
> CMS.

The people that told you this, did they test with JDK6u18 or a recent
JDK7 snapshot (e.g. build 80)? The former includes a lot of fixes
since it was integrated into JDK6 (update 14) and the latter even
more. In fact, they recently removed the experimental tag in the
mercurial tree.

> Hopefully it's just a matter of time before G1 becomes faster and
> better.

We're all hoping. :) It's also possible that people need some time to
learn what switches to use with it for best performance.

Best,
Ismael

Feng Wu

unread,

Jan 25, 2010, 11:43:30 AM1/25/10

to project-...@googlegroups.com

With regard to BDB, we are using the JE version of it.

We are hoping OS cache would help as well. The box we are provisioning actually have 48 GB RAM.

Right now we are running on 10 GB heap with 4 GB BDB cache. We used memory settings recommend by Project Voldemort's configuration page. It is working quite exceedingly well actually. I was just concerned about going from 10 GB heap to something much larger, say 20~30 GB. So it is nice to see LinkedIn running Voldemort with 18 GB heap.

On Fri, Jan 22, 2010 at 7:51 PM, Tatu Saloranta <tsalo...@gmail.com> wrote:

With that much free memory for underlying OS, doesn't it cache access
already? (true for linux, I'd assume most unixes).
Using JVM memory for caching can be rather expensive (from GC perspective).
OS Disk cache should remove/reduce I/O, even though system calls are
still needed.

Also, "native" BDB (not JE one) manages its cache outside of JVM, so
theoretically it might be less prone to death-by-GC. Assuming it does
not suffer from poor concurrency -- JE is said to be superior for high
concurrency cases, and Voldemort would seem like one, if read/write
ratio is not huge.

>
> My questions are 1) Are those reasonable ideas? and 2) Do you have any
> experience running such a large heap, what kind of GV tuning are needed to
> ensure we don't get into GC pause death?

:-)

This is indeed problematic. Perhaps the latest and greatest G1
("garbage first") could help. You probably want a concurrent
collector, to try avoid old gc at almost any cost. But that is easier
said and done, sometimes seems to be "big crash every couple of hours"
vs "death of thousand cuts".
I had enough problems with 1 gig cache (for a message queue engine) so
that I would be bit scared about huge caches.

-+ Tatu +-

Feng Wu

unread,

Jan 25, 2010, 11:51:26 AM1/25/10

to project-...@googlegroups.com

Thank Alex for posing the GC/Heap settings you use. As I mentioned earlier in reply to Tatu, we are using pretty much identical settings, except for our 10GB heap and 4 GB BDB cache part.

Just a couple clarification, right now even though we run both Voldemort and our application on the same physical box, they do run in different JVMs. The IO contention, we believe, mostly come from disk, we our application does heavy logging on to disk. The probably impactly BDB negatively when the data we try to read are not in its cache, or when it checkpoints and writes to disk.

Thanks. The 18 GB heap that you are using definitely give me some confidence as we test out large heaps for our Voldemort cluster.

-Feng

Feng Wu

unread,

Jan 25, 2010, 12:01:00 PM1/25/10

to project-...@googlegroups.com

On Sun, Jan 24, 2010 at 11:30 PM, ijuma <ism...@juma.me.uk> wrote:

On Jan 25, 2:58 am, AshwinJay <ashwin.jayaprak...@gmail.com> wrote:
> It's very encouraging to see people running 18Gig heaps successfully.
> I've heard that the G1 collector is still slower (and buggier) than
> CMS.

The people that told you this, did they test with JDK6u18 or a recent
JDK7 snapshot (e.g. build 80)? The former includes a lot of fixes
since it was integrated into JDK6 (update 14) and the latter even
more. In fact, they recently removed the experimental tag in the
mercurial tree.

G1 is still experimental on 6u18. For now though I felt that we don't have enough experiences, actually we haven't had any experience with it yet, to comfortably put is into production use yet. We will definitely keep an eye on it, and test it out with load tests. As of now our plan still is to go with our tried and true CMS, maybe experimenting with the i-CMS mode is load tests proves existing GC settings in adequate with larger heap.

> Hopefully it's just a matter of time before G1 becomes faster and
> better.

We're all hoping. :) It's also possible that people need some time to
learn what switches to use with it for best performance.

I agree. I think G1 is just so new that people don't know the in and outs about how to use/tune it yet.

Best,
Ismael

ijuma

unread,

Jan 25, 2010, 12:12:14 PM1/25/10

to project-voldemort

On Jan 25, 5:01 pm, Feng Wu <fengw...@gmail.com> wrote:
> On Sun, Jan 24, 2010 at 11:30 PM, ijuma <ism...@juma.me.uk> wrote:
> > On Jan 25, 2:58 am, AshwinJay <ashwin.jayaprak...@gmail.com> wrote:
> > > It's very encouraging to see people running 18Gig heaps successfully.
> > > I've heard that the G1 collector is still slower (and buggier) than
> > > CMS.
>
> > The people that told you this, did they test with JDK6u18 or a recent
> > JDK7 snapshot (e.g. build 80)? The former includes a lot of fixes
> > since it was integrated into JDK6 (update 14) and the latter even
> > more. In fact, they recently removed the experimental tag in the
> > mercurial tree.
>
> G1 is still experimental on 6u18.

Yes. When I say mercurial tree, I mean the development tree, so it
only affects JDK7 until the next HotSpot update in JDK6, my guess is
JDK6u22 (there have been HotSpot updates in u10, u14 and u18). I would
definitely not recommend using it in production yet.

Ismael

Feng Wu

unread,

Jan 25, 2010, 12:22:16 PM1/25/10

to project-...@googlegroups.com

On Sun, Jan 24, 2010 at 6:58 PM, AshwinJay <ashwin.ja...@gmail.com> wrote:

It's very encouraging to see people running 18Gig heaps successfully.
I've heard that the G1 collector is still slower (and buggier) than
CMS. Hopefully it's just a matter of time before G1 becomes faster and
better.

Until then, -XX:+UseLargePages and -XX:+UseTLAB will help for *nix
platforms. (More: http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp)

Hmm, I have to look into the large pages option. Our app runs on Linux box which already supports large page size:

> cat /proc/meminfo | grep Huge
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

However this seems to be one of those more exotic GC settings. Are there any indicators/symptoms that would suggest the use of this setting, other than that we are running a large heap?

I thought use TLAB is by default on already, am I mistaken? We are using jdk 1.6.0_13 at the moment.

Thanks.

-Feng

AshwinJay

unread,

Jan 25, 2010, 5:06:42 PM1/25/10

to project-voldemort

Large pages - http://developer.amd.com/documentation/articles/pages/322006145.aspx.
It's an old article for AMD but still applicable.

Ashwin.

Reply all

Reply to author

Forward