Systematic error in the Voldemort's performance tool.

38 views
Skip to first unread message

Paolo Forte

unread,
Mar 4, 2015, 5:39:49 PM3/4/15
to project-...@googlegroups.com
Hi everybody, I am trying to create a load generator and a sensor based on the built-in performance tool. The scenario is the following:
I run a server on the machine A:
bin/voldemort-server.sh config/single_node_cluster

In the same local network I run the performance tool on the machine B:
./voldemort-performance-tool.sh --interval 1 --iterations 1 --metric-type summary --nuconnections-per-node 100 --percent-cached 0 -r 50 -w 50 --record-selection uniform --store-name test --target-throughput 1000 --threads 16 --url tcp://<ip-addr> --value-size 4000 --ops-count 999999

Periodically outliers appear and the measured values sky high. The period seems to be quite stable, I tried on different machines and I stopped all the background services.
What can it be? How can it be solved?

Arunachalam

unread,
Mar 4, 2015, 5:55:19 PM3/4/15
to project-...@googlegroups.com
I have never looked at the voldemort performance tool.

But can you do the initial investigation ?

Where are you hitting the bottlenecks ? CPU/ Memory or Network. You can modify the performance tool to start with gc log dump and see if the client GC pause time aligns with the peak.

Send me what you find and we will proceed from there.

Thanks,
Arun.


--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

Paolo Forte

unread,
Mar 5, 2015, 8:07:34 AM3/5/15
to project-...@googlegroups.com
So, here the results I collected and the procedure I followed.

PREMISE: I slightly modified the performance tool in order to have the measurements evaluated each second ( while in the original each measurement is evaluated since the startup) and I parse its output through a bash script in order to have better readable results.

1) I start the server on machine A:  bin/voldemort-server.sh config/single_node_cluster
2) On A, I run jps in order to obtain voldemort's pid
3) On A, I run  jstat -gcutil -t 60890 1s 1000 . jstat output  is pretty stable so far. 
4) On a machine B in the same local network, I start my performance tool.
./voldemort-performance-tool.sh --interval 1 --iterations 1 --metric-type summary --num-connections-per-node 100 --percent-cached 0 -r 50 -w 50 --record-selection uniform --store-name test --target-throughput 1000 --threads 16 --url tcp://<ip_addr>:6666 --value-size 4000 --ops-count 99999

The output of my performance tool is this: http://pastebin.com/djiVcLBw
The output of jstat is this: http://pastebin.com/gmZ9z0E2

I put the two terminals side by side and I noticed that every time "FGC" increases, I have a "glitch" in the performance tool output.
when I stopped the performance tool in the ending, the output of jstat become stable again.

Do you need anything else?

Arunachalam

unread,
Mar 5, 2015, 10:57:13 AM3/5/15
to project-...@googlegroups.com
When the server's gc time increases the client is noticing the spike in the latencies ? Is that your conclusion, if so this could be expected.

Single_node_cluster runs with a very small heap footprint. Production servers typically run with 32+ gb heaps and CMS enabled to avoid this.

Thanks,
Arun.

Paolo Forte

unread,
Mar 5, 2015, 11:25:54 AM3/5/15
to project-...@googlegroups.com
I tried to modify the file bin/voldemort-server.sh
In particular, I modified the lines:
if [ -z "$VOLD_OPTS" ]; then
  VOLD_OPTS="-Xmx2G -server -Dcom.sun.management.jmxremote"
fi
increasing the numbero of giga
if [ -z "$VOLD_OPTS" ]; then
  VOLD_OPTS="-Xmx64G -server -Dcom.sun.management.jmxremote"
fi

but still doesn't work.

Can you address me toward the solution of my problem?

Thanks.

Arunachalam

unread,
Mar 5, 2015, 1:02:42 PM3/5/15
to project-...@googlegroups.com
Look at bin/voldemort-prod-server.sh for a typical production server configuration. 

Thanks,
Arun.

Paolo Forte

unread,
Mar 9, 2015, 2:20:31 PM3/9/15
to project-...@googlegroups.com
Hi,
I followed your suggestion but I guess I am missing something since my problem is still there. 

I run the server launching 
voldemort/bin/voldemort-prod-server.sh voldemort/config/single_node_cluster
 
Then, I run the performance tool launching
voldemort/bin/voldemort-performance-tool.sh --interval 1 --iterations 1 --metric-type summary --num-connections-per-node 100 --percent-cached 0 -r 50 -w 50 --record-selection uniform --store-name test --target-throughput 1000 --threads 16 --url tcp://<ip_addr>:6666 --value-size 4000 --ops-count 99999

I also tried to modify the file config/single_node_cluster/config/server.properties increasing bdb.cache.size=1G to bdb.cache.size=16G

What I found till now (see attached images) is that, looking at Jconsole, each "glitch" corresponds to a sawtooth in the memory usage (limited to 1 giga) even though, running top, you can see I assigned 32 giga of memory to java. How can I overcome this threshold? Would you mind to address me toward a solution?

I appreciate a lot your aid. Thanks!

Felix GV

unread,
Mar 9, 2015, 2:37:54 PM3/9/15
to project-...@googlegroups.com
If jconsole shows the peak of a seesaw, that would imply that GC happened at that time, since it freed some memory inside the JVM. Note that as far as the OS is concerned, the memory is already assigned to the JVM, so you will not see a drop of assigned memory in top. GC will definitely cause spikes in p99 (and possibly p95) latency if it is not tuned optimally for your workload...


--
 
Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv

From: project-...@googlegroups.com [project-...@googlegroups.com] on behalf of Paolo Forte [forte....@gmail.com]
Sent: Monday, March 09, 2015 11:20 AM
To: project-...@googlegroups.com
Subject: Re: [project-voldemort] Systematic error in the Voldemort's performance tool.

Arunachalam

unread,
Mar 9, 2015, 3:53:08 PM3/9/15
to project-...@googlegroups.com
How much memory this server has ? Just increasing the java memory and having the heap in hard disk, is going to make it worse than running with a small amount of memory. 

Is this memory locked to the Voldemort Server process ?  What is the machine you are running this on ? Make sure you are running on a server box with server OS. Client OS's are optimized for response time and server processes though run, really have unpredictable performance.

Thanks,
Arun.

Paolo Forte

unread,
Mar 10, 2015, 12:31:07 PM3/10/15
to project-...@googlegroups.com
The machines have 64 GB RAM, two 12-core AMD Opteron processors, a 500 GB hard disk and a 1 Gb network controller. They run Ubuntu 12.04 LTS.

Thanks.

Paolo Forte

unread,
Mar 11, 2015, 9:58:18 AM3/11/15
to project-...@googlegroups.com
Hi, 
I have switched from voldemort/bin/voldemort-prod-server.sh to voldemort/config/single_node_cluster and I noticed some improvements (the glitches are still present but their size is almost the half than before), even though I cannot really understand why.
Moreover, I read some documents about the types of garbage collector: I have tried to use -XX:+UseParallelGC have glitches, and -XX:+UseG1GC which is faster in phase of deploying but still has glitches.
Eventually I also tried to increase the -XX:MaxNewSize parameter, but still nothing.

Do you think I can get rid of these outliers?

Paolo Forte

unread,
Mar 31, 2015, 8:14:59 AM3/31/15
to project-...@googlegroups.com
I noticed that with a read-only workload ( -r 100 ) the glitches are not present anymore. I cannot find a solution for the write workload.
Reply all
Reply to author
Forward
0 new messages