Voldemort performance on Amazon EC2

39 views
Skip to first unread message

Yoav

unread,
Sep 1, 2010, 8:52:14 AM9/1/10
to project-voldemort
Hi,

We've been testing a 4 node cluster, with the following configuration
on each of the 6 stores we manage:
<replication-factor>1</replication-factor>
<required-reads>1</required-reads>
<required-writes>1</required-writes>

Our cluster resides on Amazon EC2, on machines type: m1.large (7.5 GB
memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
each).
The data in each node is on 4 EBS volumes under RAID

We have about 35G of data on each node.

We notice that the median for query time is reasonable, but the 90th
percentile is quite high, even above the specified timeout.
Our analysis shows that when query time is high. the CPU of the
machine are very high. (memory usage is obviously high, around ~85%).
We have noticed that there's 30% iowait on the average, which makes us
believe that threads are blocking on i/o and no work gets done (and is
unlikely to get done regardless of more threads on the same machine).

In the voldemort log files we don't see anything special rather than
(relatively) a lot of connect-disconnect following broken pipes. I
assume this is related to the client detecting "Insufficient
operational nodes" due to the problem, and therefore "restarting" the
connections.

The IO seems a bit severe, does anyone have any experience with
Voldemort on EC2?

Thanks

Below is the bdb configuration in server.xml.

# BDB
bdb.sync.transactions=false
bdb.cache.size=2500MB
bdb.max.logfile.size=60MB
bdb.one.env.per.store=true
je.cleaner.minUtilization=25
bdb.cleaner.minUtilization=25
je.cleaner.threads=1
je.cleaner.readSize=102400
je.cleaner.lockTimeout=10000
je.checkpointer.highPriority=false
je.env.backgroundReadLimit=5
je.env.backgroundWriteLimit=5

gxm

unread,
Sep 1, 2010, 12:43:15 PM9/1/10
to project-voldemort
What are your GC settings and application profile?

From my testing of our write heavy application in EC2, once a 3GB bdb
cache filled, the GC would spend significant cpu resources, especially
in the "abortable pre-clean" stage.
I found my best throughput with a 512MB bdb cache and these java
settings:

-Xmx4G -Xms4G -XX:NewSize=2G -XX:MaxNewSize=2G -XX:+UseConcMarkSweepGC
-XX:+UseParNewGC -XX:TargetSurvivorRatio=90 -XX:SurvivorRatio=8 -
XX:MaxTenuringThreshold=31 -XX:CMSInitiatingOccupancyFraction=70 -XX:
+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -
Xloggc:/tmp/gc.log

Yoav Naveh

unread,
Sep 1, 2010, 12:58:47 PM9/1/10
to project-...@googlegroups.com
Hi, thanks for your reply.

This is my startup command:

java -Xmx6500m -server -XX:NewSize=2048m -XX:MaxNewSize=2048m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$LOG_DIR/gc.log -cp $CLASSPATH 
-Dcom.sun.management.jmxremote voldemort.server.VoldemortServer ${1}

So it seems we are pretty much inline on that aspect.

Regarding bdb cache size, we took it down to 2.5G gradually (initially we had 3.5G and 7G for JVM) and we did see improvement. But eventually we got caught with the issues on the 2.5G as well. It seems a bit problematic for us to keep only 512MB for cache, as this is a very small portion of our data and that would mean numerous machines in order to keep a reasonable amount in the cache. How much data do you hold? How are you set up to handle this?


--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.


Alex Feinberg

unread,
Sep 1, 2010, 2:18:20 PM9/1/10
to project-...@googlegroups.com
If you're using 6.5gb of heap, you may want to reduce your NewSize

- Alex

gxm

unread,
Sep 1, 2010, 3:30:59 PM9/1/10
to project-voldemort
Our application is write heavy, and so we don't need a large cache
size.
The number of reads that we do are small enough that we can afford a
disk hit for reads.

The problem I ran into with a large BDB cache size, which led to high
cpu usage, was this:
13383.651: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 13388.698: [CMS-concurrent-abortable-
preclean: 4.143/5.047 secs] [Times: user=3.29 sys=0.34, real=5.04
secs]

What do your gc logs say?
> > project-voldem...@googlegroups.com<project-voldemort%2Bunsubscr i...@googlegroups.com>
> > .

gxm

unread,
Sep 1, 2010, 3:38:38 PM9/1/10
to project-voldemort
I just reread your first post, and realized I missed this - "The data
in each node is on 4 EBS volumes under RAID"

We did all of our testing using the native disk on the m1.large
instances, which, at 800+ GB, would be more than enough to hold your
data.
I've not tested using EBS, but I would guess that being a remote
store, requiring at least one network hop, it might have worse
performance, and lead to the high iowait.

Tatu Saloranta

unread,
Sep 1, 2010, 5:17:35 PM9/1/10
to project-...@googlegroups.com
On Wed, Sep 1, 2010 at 9:58 AM, Yoav Naveh <yoav...@gmail.com> wrote:
> Hi, thanks for your reply.
> This is my startup command:
> java -Xmx6500m -server -XX:NewSize=2048m -XX:MaxNewSize=2048m
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintTenuringDistribution
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$LOG_DIR/gc.log -cp
> $CLASSPATH
> -Dcom.sun.management.jmxremote voldemort.server.VoldemortServer ${1}
> So it seems we are pretty much inline on that aspect.
> Regarding bdb cache size, we took it down to 2.5G gradually (initially we
> had 3.5G and 7G for JVM) and we did see improvement. But eventually we got
> caught with the issues on the 2.5G as well. It seems a bit problematic for
> us to keep only 512MB for cache, as this is a very small portion of our data
> and that would mean numerous machines in order to keep a reasonable amount
> in the cache. How much data do you hold? How are you set up to handle this?

Keep in mind that this cache is rather expensive in-process (in JVM
heap) cache; but it is not the only caching layer.
OS often caches data blocks at lower level, and does it more
efficiently than JVM would (this is obviously a trade-off; OS does
some things better than explicit app-level cache manager).
As such, maximing in-process cache is often not a good way to go.

-+ Tatu +-

Yoav Naveh

unread,
Sep 3, 2010, 4:10:16 PM9/3/10
to project-...@googlegroups.com
Are you using the local ephemeral disk (/mnt) or the main one?

Anyway, here is what gc log says:
To check, I did "grep YG gc.log" on 204.236.223.190, and apparently this arena is between 0.4G and 1.8G after collection (which happens every 20-30 secs; the rate of collection is ~1GB/sec, with 5-10% of CPU allocated to GC).

Per Alex's suggestion I changed the start command to use:
 -XX:NewSize=1024m -XX:MaxNewSize=1024m

After a couple of days of run I do not see significant improvement.

Any thoughts?

Thanks



To unsubscribe from this group, send email to project-voldem...@googlegroups.com.

gxm

unread,
Sep 3, 2010, 5:17:09 PM9/3/10
to project-voldemort
I used the ephemeral disk under /mnt
You could set up both 420GB disks in RAID0, but I never got around to
that.

I also used an GC analyzer to get GC throughput, which I eventually
got up to 99% using the settings listed above.
I used http://www.tagtraum.com/gcviewer.html , and there are many
others.
I found the visualization in gcviewer to be quite helpful.
> > > > > project-voldem...@googlegroups.com<project-voldemort%2Bunsubscr i...@googlegroups.com><project-voldemort%2Bunsubscr

Maarten Koopmans

unread,
Sep 6, 2010, 2:04:55 PM9/6/10
to project-...@googlegroups.com
I had a similar headachish problem with zookeeper last week (a
different beast, but complex distributed middleware nonetheless).
Anyway, I plugged in a trial of YourKit as profiler and saved myself
quite some time.

I haven't bought it and am not affiliated with them in any way, but it
was amazingly simple to profile when you take a few snapshots (and
should you try, watch 20 mins of videos or so to get the hang of it).

Anyway, end of unsolicited advice.

--Maarten

>> > > > > > <> > > > > project-voldem...@googlegroups.com<project-voldemort%2Bunsubscr i...@googlegroups.com><project-voldemort%2Bunsubscr
>> > i...@googlegroups.com>
>> > > > > .
>> > > > > For more options, visit this group at
>> > > > >http://groups.google.com/group/project-voldemort?hl=en.
>>
>> > --
>> > You received this message because you are subscribed to the Google Groups
>> > "project-voldemort" group.
>> > To post to this group, send email to project-...@googlegroups.com.
>> > To unsubscribe from this group, send email to
>> > project-voldem...@googlegroups.com<project-voldemort%2Bunsubscr i...@googlegroups.com>
>> > .
>> > For more options, visit this group at
>> >http://groups.google.com/group/project-voldemort?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> To post to this group, send email to project-...@googlegroups.com.

> To unsubscribe from this group, send email to project-voldem...@googlegroups.com.

Yoav Naveh

unread,
Sep 6, 2010, 8:21:34 PM9/6/10
to project-...@googlegroups.com
Hi,

I will try to move the files to /mnt tomorrow, see if this helps with iowait, which is high during times of problems.

Regarding profilers, could you explain a bit as to what you were looking for? Is it to change the newSize parameters, bdb cache size? I'd appreciate some more information on your process, just to get a lead to where to start looking.

Thanks!
Reply all
Reply to author
Forward
0 new messages