--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.
Thanks for the stats. Just curios, what is your cluster size, and
what are the replication factor, required/prefered R/W settings?
The GET and PUT timings, is that raw BDB operation numbers? Are those
numbers exposed somewhere via a MBean? Do I have to turn on voldemort
server side debugging to get these raw performance numbers first?
Finally, what do you think of the idea of running Voldemort on its own
cluster, instead of along with our application on the same box? I'd
imagine it's probably the norm to run applications and voldemort on
separate machines, i.e., what we did is kind of an exception rather
than the norm.
Thank you very much.
-Feng
On Jan 22, 1:40 pm, bhupesh bansal <bbansal....@gmail.com> wrote:
> Hey Feng,
>
> We are indeed running JVMs of about 18G with 10G to bdb cache in production.
> The bdb disk file sizes are about 120G (varies from 108G - 135G due to lazy
> cleaning in bdb)
> the performance we have seen under peak load is
> QPS (node) 1450, QPS(cluster) = 4300
> with Get timings : 1ms , 36 ms , 70 ms (for 50% , 95% 99%)
> Put timings: 0.5 ms ,1 ms, 2 ms
>
> The GC settings we use are same as on voldemort configuration page except
> the JVM size http://project-voldemort.com/configuration.php
>
> Best
> Bhupesh
>
> > project-voldem...@googlegroups.com<project-voldemort%2Bunsu...@googlegroups.com>
With that much free memory for underlying OS, doesn't it cache access
already? (true for linux, I'd assume most unixes).
Using JVM memory for caching can be rather expensive (from GC perspective).
OS Disk cache should remove/reduce I/O, even though system calls are
still needed.
Also, "native" BDB (not JE one) manages its cache outside of JVM, so
theoretically it might be less prone to death-by-GC. Assuming it does
not suffer from poor concurrency -- JE is said to be superior for high
concurrency cases, and Voldemort would seem like one, if read/write
ratio is not huge.
>
> My questions are 1) Are those reasonable ideas? and 2) Do you have any
> experience running such a large heap, what kind of GV tuning are needed to
> ensure we don't get into GC pause death?
:-)
This is indeed problematic. Perhaps the latest and greatest G1
("garbage first") could help. You probably want a concurrent
collector, to try avoid old gc at almost any cost. But that is easier
said and done, sometimes seems to be "big crash every couple of hours"
vs "death of thousand cuts".
I had enough problems with 1 gig cache (for a message queue engine) so
that I would be bit scared about huge caches.
-+ Tatu +-
While, I am not familiar with specifics of application, it's usually
a good idea to separate out the application from the storage system.
At least, as the memory access patterns are going to be fairly
different between the storage system and the application (lots of
transient objects in the application, lots of permanent objects in the
storage system, application being -- generally -- CPU intesive,
storage system being memory and I/O intensive) it would make more
sense to run them in a different container/JVM. This way, there's
added fault tolerance: if a JVM running the storage system experiences
a pause/crash on one machine, the application on this machine still
remains available and can merely query the storage system on another
machine and vice-versa.
Here are the full settings that we're using in production with a 10GB
BDB cache, 18GB JVM heap. Note, these didn't take a particularly
lengthy amount of tuning and we do not experience garbage collection
issues with Voldemort:
-Xms18g
-Xmx18g
-XX:NewSize=2048m
-XX:MaxNewSize=2048m
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70
-XX:SurvivorRatio=2
We have multiple Voldemort clusters in production for both read-only
and read-write data, several per-team, with different replication
factors. The cluster I pasted the settings for has ~6 nodes. We're
presently in the middle of rolling our rebalancing and expanding this
specific cluster, to improve the ratio of RAM to disk.
Note, however, as Tatu suggested, the operating system page cache can
be *very* efficient. We haven't yet tested the possibility that large
BDB cache size may, in a way, be competing with the operating system
page cache in *some* scenario.
Thanks,
- Alex
> To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
One thing you can try is to use -XX:+UseCompressedOops with JDK6u18
(don't try it on older releases) as that can help you fit more in the
same amount of memory (particularly if the application is reference-
heavy). It basically changes the size of a reference in the Java heap
to 32-bits in a 64-bit JVM. The maximum heap size where that option
works is 32GB. I wrote a bit about it a while ago:
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/
More recently, I included that option in a set of micro-benchmarks and
it had a very positive effect on performance too:
http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-performance/
Best,
Ismael
Until then, -XX:+UseLargePages and -XX:+UseTLAB will help for *nix
platforms. (More: http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp)
Ashwin.
> On Fri, Jan 22, 2010 at 2:18 PM, Feng Wu <fengw...@gmail.com> wrote:
> > Hi, Bhupesh,
>
> > Thanks for the stats. Just curios, what is your cluster size, and
> > what are the replication factor, required/prefered R/W settings?
>
> > The GET and PUT timings, is that raw BDB operation numbers? Are those
> > numbers exposed somewhere via a MBean? Do I have to turn on voldemort
> > server side debugging to get these raw performance numbers first?
>
> > Finally, what do you think of the idea of running Voldemort on its own
> > cluster, instead of along with our application on the same box? I'd
> > imagine it's probably the norm to run applications and voldemort on
> > separate machines, i.e., what we did is kind of an exception rather
> > than the norm.
>
> > Thank you very much.
>
> > -Feng
>
> > On Jan 22, 1:40 pm, bhupesh bansal <bbansal....@gmail.com> wrote:
> >> Hey Feng,
>
> >> We are indeed running JVMs of about 18G with 10G to bdb cache in production.
> >> The bdb disk file sizes are about 120G (varies from 108G - 135G due to lazy
> >> cleaning in bdb)
> >> the performance we have seen under peak load is
> >> QPS (node) 1450, QPS(cluster) = 4300
> >> with Get timings : 1ms , 36 ms , 70 ms (for 50% , 95% 99%)
> >> Put timings: 0.5 ms ,1 ms, 2 ms
>
> >> The GC settings we use are same as on voldemort configuration page except
> >> the JVM sizehttp://project-voldemort.com/configuration.php
The people that told you this, did they test with JDK6u18 or a recent
JDK7 snapshot (e.g. build 80)? The former includes a lot of fixes
since it was integrated into JDK6 (update 14) and the latter even
more. In fact, they recently removed the experimental tag in the
mercurial tree.
> Hopefully it's just a matter of time before G1 becomes faster and
> better.
We're all hoping. :) It's also possible that people need some time to
learn what switches to use with it for best performance.
Best,
Ismael
With that much free memory for underlying OS, doesn't it cache access
already? (true for linux, I'd assume most unixes).
Using JVM memory for caching can be rather expensive (from GC perspective).
OS Disk cache should remove/reduce I/O, even though system calls are
still needed.
Also, "native" BDB (not JE one) manages its cache outside of JVM, so
theoretically it might be less prone to death-by-GC. Assuming it does
not suffer from poor concurrency -- JE is said to be superior for high
concurrency cases, and Voldemort would seem like one, if read/write
ratio is not huge.
:-)
>
> My questions are 1) Are those reasonable ideas? and 2) Do you have any
> experience running such a large heap, what kind of GV tuning are needed to
> ensure we don't get into GC pause death?
This is indeed problematic. Perhaps the latest and greatest G1
("garbage first") could help. You probably want a concurrent
collector, to try avoid old gc at almost any cost. But that is easier
said and done, sometimes seems to be "big crash every couple of hours"
vs "death of thousand cuts".
I had enough problems with 1 gig cache (for a message queue engine) so
that I would be bit scared about huge caches.
-+ Tatu +-
On Jan 25, 2:58 am, AshwinJay <ashwin.jayaprak...@gmail.com> wrote:The people that told you this, did they test with JDK6u18 or a recent
> It's very encouraging to see people running 18Gig heaps successfully.
> I've heard that the G1 collector is still slower (and buggier) than
> CMS.
JDK7 snapshot (e.g. build 80)? The former includes a lot of fixes
since it was integrated into JDK6 (update 14) and the latter even
more. In fact, they recently removed the experimental tag in the
mercurial tree.
> Hopefully it's just a matter of time before G1 becomes faster andWe're all hoping. :) It's also possible that people need some time to
> better.
learn what switches to use with it for best performance.
Best,
Ismael
On Jan 25, 5:01 pm, Feng Wu <fengw...@gmail.com> wrote:
> On Sun, Jan 24, 2010 at 11:30 PM, ijuma <ism...@juma.me.uk> wrote:
> > On Jan 25, 2:58 am, AshwinJay <ashwin.jayaprak...@gmail.com> wrote:
> > > It's very encouraging to see people running 18Gig heaps successfully.
> > > I've heard that the G1 collector is still slower (and buggier) than
> > > CMS.
>
> > The people that told you this, did they test with JDK6u18 or a recent
> > JDK7 snapshot (e.g. build 80)? The former includes a lot of fixes
> > since it was integrated into JDK6 (update 14) and the latter even
> > more. In fact, they recently removed the experimental tag in the
> > mercurial tree.
>
> G1 is still experimental on 6u18.
Yes. When I say mercurial tree, I mean the development tree, so it
only affects JDK7 until the next HotSpot update in JDK6, my guess is
JDK6u22 (there have been HotSpot updates in u10, u14 and u18). I would
definitely not recommend using it in production yet.
Ismael
It's very encouraging to see people running 18Gig heaps successfully.
I've heard that the G1 collector is still slower (and buggier) than
CMS. Hopefully it's just a matter of time before G1 becomes faster and
better.
Until then, -XX:+UseLargePages and -XX:+UseTLAB will help for *nix
platforms. (More: http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp)
Ashwin.