Voldemort Connection problems

142 views
Skip to first unread message

Miguel Ausó

unread,
Dec 29, 2015, 6:49:03 AM12/29/15
to project-...@googlegroups.com

Hi, I need your opinion about a extrange behavior that I have in my voldemort platform.


Platform structure -> 16 Voldemort servers 4 servers per zone, 4 zones -> Each Voldemort Server is a LXC in 4 physical server -> 50GB per LXC -> 17GB BDB cache

Behavior. 

Normally we have about 280 clients that it working with Voldemort cluster, every client use Java Voldemort Client with the default parameters, in this situation we have a stable platform with the following status

Clusters stats

Main Connections : 33K about 2K-2.5K per servers
Operations for all stores : 10K
GC Time : 30ms
HeapMemoryuse: 130mb
IOPS server 0.30
CPU : 5%
Wait : 0%
Daemon Thread Control : 121mb

Then

We add 30 clients more with the same configuration, in this moment we begin to have problems in the Cluster. 

we can see this error 

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: [10:25:04,067 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Completed streaming slop pusher job which started at Tue Dec 29 1...cutor$Worker]

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.lang.Thread.run(Thread.java:745)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: Caused by: voldemort.store.UnreachableStoreException: Failure while checking out socket for bcn1-cache-vold-095p1:6666(vp1):

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.UnreachableStoreException.wrap(UnreachableStoreException.java:41)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutorPool.checkout(ClientRequestExecutorPool.java:214)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.SocketStore.request(SocketStore.java:278)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.SocketStore.get(SocketStore.java:200)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.SocketStore.get(SocketStore.java:62)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.serialized.SerializingStore.get(SerializingStore.java:107)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.client.AbstractStoreClientFactory.getRemoteMetadata(AbstractStoreClientFactory.java:579)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.client.SocketStoreClientFactory.getRemoteMetadata(SocketStoreClientFactory.java:97)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: ... 16 more

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: Caused by: java.net.ConnectException: ClientRequestExecutor timed out for destination bcn1-cache-vold-095p1:6666(vp1)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutorFactory$1.requestComplete(ClientRequestExecutorFactory.java:210)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.invokeCallback(NonblockingStoreCallbackClientRequest.java:68)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.timeOut(NonblockingStoreCallbackClientRequest.java:128)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutor.completeClientRequest(ClientRequestExecutor.java:358)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutor.close(ClientRequestExecutor.java:200)

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutor.checkTimeout(ClientRequestExecutor.java:108)



it seems that we have a connection problem, if we check the jmx graph we can see connection problems in some servers




we can see that when we added the new 30 servers, we begin to lose stability on some servers



or other hand we can see a strange behavior DaemonTheadControl



In voldemort logs, we have connection problems in the same time (The ethernet interfaces for the servers are not busy)


we have the same time in GC 30ms.


So, we know that probably we have some limit in the Voldemort server or client, but we can’t find.


Finally we have this java and servers options


rNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2 -XX:+AlwaysPreTouch -XX:+UseCompressedOops -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+PrintGCApplicationStoppedTime -XX:+

PrintGCApplicationConcurrentTime -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=7198 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false voldemort.server.VoldemortServer


Introducir código aquí...

node.id=1


max.threads=20000


############### DB options ######################


http.enable=true

socket.enable=true


# BDB

bdb.write.transactions=false

bdb.flush.transactions=false

bdb.cache.size=17G

bdb.one.env.per.store=true


# Mysql

mysql.host=localhost

mysql.port=1521

mysql.user=root

mysql.password=3306

mysql.database=test


#NIO connector settings.

enable.nio.connector=true


request.format=vp3

storage.configs=voldemort.store.bdb.BdbStorageConfiguration, voldemort.store.readonly.ReadOnlyStorageConfiguration, voldemort.store.memory.CacheStorageConfiguration

Introducir código aquí...

Java options


java -Dlog4j.configuration=file:///opt/voldemort/src/java/log4j.properties -server -Xms28g -Xmx28g -XX:NewSize=2048m -XX:MaxNewSize=2048m -XX:+UseConcMarkSweepGC -XX:+UsePa

rNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2 -XX:+AlwaysPreTouch -XX:+UseCompressedOops -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+PrintGCApplicationStoppedTime -XX:+

PrintGCApplicationConcurrentTime -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=7198 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false voldemort.server.VoldemortServer




Any ideas?


Extra info in the last minute


All clients does not have configure the client zone, then for default all clients are connected in zone 0 ( I suppose )


Thanks!

Arunachalam

unread,
Dec 30, 2015, 10:42:45 AM12/30/15
to project-...@googlegroups.com

What is the client and server version?

...

Miguel Ángel Ausó

unread,
Dec 30, 2015, 10:43:51 AM12/30/15
to project-...@googlegroups.com
Hi

It is 1.9.18

--
You received this message because you are subscribed to a topic in the Google Groups "project-voldemort" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/project-voldemort/rkP7gzLCq74/unsubscribe.
To unsubscribe from this group and all its topics, send an email to project-voldem...@googlegroups.com.
Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

Arunachalam

unread,
Dec 30, 2015, 10:48:58 AM12/30/15
to project-...@googlegroups.com

Can you increase the number of server selectors to see if it makes a difference?

You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

Miguel Ángel Ausó

unread,
Dec 30, 2015, 10:51:15 AM12/30/15
to project-...@googlegroups.com
what do you mean with server selectors?

Arunachalam

unread,
Dec 30, 2015, 11:03:11 AM12/30/15
to project-...@googlegroups.com

Arunachalam

unread,
Dec 30, 2015, 11:04:07 AM12/30/15
to project-...@googlegroups.com

I am on a tablet, will send you more info when I am back on my macbook.

Miguel Ausó

unread,
Dec 31, 2015, 4:32:59 AM12/31/15
to project-voldemort
Hi @Arunachalam

I found info about nio connectors
public void setNioConnectorSelectors(int nioConnectorSelectors)

If I understand, it's a value in the client part, although if same that Server value, the Voldemort process modified this value depending on the cpu numbers. (currently in my servers this value has 32, because every server has 32 CPUs)

In any case, can you tell me if I'm right?

Thanks

Arunachalam

unread,
Jan 1, 2016, 3:37:16 PM1/1/16
to project-...@googlegroups.com
Miguel, it is hard to tell the Server Part. The easier way is to capture the thread dump and look for nio in the thread name.

There are multiple different selectors running inside the server.

So there are couple of things I am suspecting. In voldemort, server selectors listen for client requests and processes them. increase in the number of clients means each selector has more requests to process and back pressure slowly starts piling up until it breaks everything. Increasing the number of selectors will help alleviate this problem. There is also a jmx counter you can monitor to see how fast the selectors are hitting the select call. This is under voldemort.server.niosocket nio-socket-server and look for selectTimeMS99th . 

This is little tricky to monitor the way the select is coded is typically if there are requests it wakes up immediately but if there is no requests it wakes up only every 500 ms or so. You need to know how the NIO works to monitor this. But if you see high times when the bad things happen, you have little selectors. Increasing the selectors might also move the load to the IO subsystem but I hope IO is not the bottleneck.

This is how you can see how many selectors you have on server side.

athirupa-mn1:voldemort]$ jps -vvv
9459 Jps -Dapplication.home=/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home -Xms8m
9428 VoldemortServer -Dlog4j.configuration=file:///Users/athirupa/projects/voldemort/src/java/log4j.properties -Xmx2G -Dcom.sun.management.jmxremote

[athirupa-mn1:voldemort]$ jstack -l 9428 | grep 'voldemort-nio-socket-server-t'
"voldemort-nio-socket-server-t8" #52 daemon prio=5 os_prio=31 tid=0x00007fb963977000 nid=0x9f03 runnable [0x0000000128438000]
"voldemort-nio-socket-server-t7" #51 daemon prio=5 os_prio=31 tid=0x00007fb963a02000 nid=0x9d03 runnable [0x0000000128335000]
"voldemort-nio-socket-server-t6" #50 daemon prio=5 os_prio=31 tid=0x00007fb963483000 nid=0x9b03 runnable [0x0000000128232000]
"voldemort-nio-socket-server-t5" #49 daemon prio=5 os_prio=31 tid=0x00007fb96348c800 nid=0x9903 runnable [0x000000012812f000]
"voldemort-nio-socket-server-t4" #48 daemon prio=5 os_prio=31 tid=0x00007fb963493800 nid=0x9703 runnable [0x000000012802c000]
"voldemort-nio-socket-server-t3" #47 daemon prio=5 os_prio=31 tid=0x00007fb963493000 nid=0x9503 runnable [0x0000000127e0b000]
"voldemort-nio-socket-server-t2" #46 daemon prio=5 os_prio=31 tid=0x00007fb9639ff000 nid=0x9303 runnable [0x0000000127d08000]
"voldemort-nio-socket-server-t1" #45 daemon prio=5 os_prio=31 tid=0x00007fb962bde000 nid=0x9103 runnable [0x0000000127c05000]


I have 8 selectors on my mac notebook :)

Thanks,
Arun.

Miguel Ausó

unread,
Jan 2, 2016, 3:48:46 PM1/2/16
to project-voldemort
I Aron

First of all, thanks for you reply.

Now the situation.  I have all servers running since 2 days ago (old and new)
If you look at the image below, you can see I have many connection errors due to timeout, also the connections in the servers are not equitable, some have more than other.


For other hand, I run the JPS command in the servers and I have this result

[root@bcn1-cache-vold-095p1:mauso]# /usr/java/jdk1.7.0_72/bin/jps -vvv
12958 VoldemortServer -Dlog4j.configuration=file:///opt/voldemort/src/java/log4j.properties -Xms28g -Xmx28g -XX:NewSize=2048m -XX:MaxNewSize=2048m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:SurvivorRatio=2 -XX:+AlwaysPreTouch -XX:+UseCompressedOops -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=7198 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
31144 Jps -Dapplication.home=/usr/java/jdk1.7.0_72 -Xms8m
[root@bcn1-cache-vold-095p1:mauso]# /usr/java/jdk1.7.0_72/bin/jstack -l 12958 | grep  'voldemort-nio-socket-server-t'
"voldemort-nio-socket-server-t32" daemon prio=10 tid=0x00007f0351230800 nid=0x3334 runnable [0x00007f00d3cfb000]
"voldemort-nio-socket-server-t31" daemon prio=10 tid=0x00007f0351216000 nid=0x3333 runnable [0x00007f00d3dfc000]
"voldemort-nio-socket-server-t30" daemon prio=10 tid=0x00007f03511fa800 nid=0x3332 runnable [0x00007f00d3efd000]
"voldemort-nio-socket-server-t29" daemon prio=10 tid=0x00007f03511df800 nid=0x3331 runnable [0x00007f00d3ffe000]
"voldemort-nio-socket-server-t28" daemon prio=10 tid=0x00007f03511c4800 nid=0x3330 runnable [0x00007f01d41c0000]
"voldemort-nio-socket-server-t27" daemon prio=10 tid=0x00007f03511a9800 nid=0x332f runnable [0x00007f01d42c1000]
"voldemort-nio-socket-server-t26" daemon prio=10 tid=0x00007f035118f000 nid=0x332e runnable [0x00007f01d43c2000]
"voldemort-nio-socket-server-t25" daemon prio=10 tid=0x00007f0351174000 nid=0x332d runnable [0x00007f01d44c3000]
"voldemort-nio-socket-server-t24" daemon prio=10 tid=0x00007f0351159000 nid=0x332c runnable [0x00007f01d45c4000]
"voldemort-nio-socket-server-t23" daemon prio=10 tid=0x00007f035113d800 nid=0x332b runnable [0x00007f01d46c5000]
"voldemort-nio-socket-server-t22" daemon prio=10 tid=0x00007f0351122800 nid=0x332a runnable [0x00007f01d47c6000]
"voldemort-nio-socket-server-t21" daemon prio=10 tid=0x00007f0351107800 nid=0x3329 runnable [0x00007f01d48c7000]
"voldemort-nio-socket-server-t20" daemon prio=10 tid=0x00007f03510ec800 nid=0x3328 runnable [0x00007f01d49c8000]
"voldemort-nio-socket-server-t19" daemon prio=10 tid=0x00007f03510d1800 nid=0x3327 runnable [0x00007f01d4ac9000]
"voldemort-nio-socket-server-t18" daemon prio=10 tid=0x00007f03510b6800 nid=0x3326 runnable [0x00007f01d4bca000]
"voldemort-nio-socket-server-t17" daemon prio=10 tid=0x00007f035109b800 nid=0x3325 runnable [0x00007f01d4ccb000]
"voldemort-nio-socket-server-t16" daemon prio=10 tid=0x00007f0351080000 nid=0x3324 runnable [0x00007f01d4dcc000]
"voldemort-nio-socket-server-t15" daemon prio=10 tid=0x00007f0351065000 nid=0x3323 runnable [0x00007f01d4ecd000]
"voldemort-nio-socket-server-t14" daemon prio=10 tid=0x00007f035104a000 nid=0x3322 runnable [0x00007f01d4fce000]
"voldemort-nio-socket-server-t13" daemon prio=10 tid=0x00007f035102f000 nid=0x3321 runnable [0x00007f01d50cf000]
"voldemort-nio-socket-server-t12" daemon prio=10 tid=0x00007f0351014000 nid=0x3320 runnable [0x00007f01d51d0000]
"voldemort-nio-socket-server-t11" daemon prio=10 tid=0x00007f0350ff9000 nid=0x331f runnable [0x00007f01d52d1000]
"voldemort-nio-socket-server-t10" daemon prio=10 tid=0x00007f0350fdd800 nid=0x331e runnable [0x00007f01d53d2000]
"voldemort-nio-socket-server-t9" daemon prio=10 tid=0x00007f0350fc2800 nid=0x331d runnable [0x00007f01d54d3000]
"voldemort-nio-socket-server-t8" daemon prio=10 tid=0x00007f0350fa7800 nid=0x331c runnable [0x00007f01d55d4000]
"voldemort-nio-socket-server-t7" daemon prio=10 tid=0x00007f0350f8c800 nid=0x331b runnable [0x00007f01d56d5000]
"voldemort-nio-socket-server-t6" daemon prio=10 tid=0x00007f0350f71800 nid=0x331a runnable [0x00007f01d57d6000]
"voldemort-nio-socket-server-t5" daemon prio=10 tid=0x00007f0350f56800 nid=0x3319 runnable [0x00007f01d58d7000]
"voldemort-nio-socket-server-t4" daemon prio=10 tid=0x00007f0350ebb000 nid=0x3318 runnable [0x00007f01d59d8000]
"voldemort-nio-socket-server-t3" daemon prio=10 tid=0x00007f0350eb9800 nid=0x3317 runnable [0x00007f01d5ad9000]
"voldemort-nio-socket-server-t2" daemon prio=10 tid=0x00007f0350ebd800 nid=0x3316 runnable [0x00007f01d5bda000]
"voldemort-nio-socket-server-t1" daemon prio=10 tid=0x00007f0350ebc800 nid=0x3315 runnable [0x00007f01d5cdb000]


32 selectors, one for each CPU.

finally, I started to graph the NIO Time


I don't have servers above 500

So, I can try to increase the NIO selectors, how can I do it? what value do you think that I should use?

If you have more ideas or you need more information please let me know

Thanks !!
...

Miguel Ausó

unread,
Jan 7, 2016, 5:01:53 AM1/7/16
to project-voldemort
Hi, 

I'm check the client part and I can see this error

2016-01-06 00:52:06,079 ERROR [AsynchronousVoldemortDistributedCache: :ED67E0848842B96ACC5B13CBE2807017] [VoldemortDistributedCache] - Exception adding entry in cache: voldemort.versioning.ObsoleteVersionException: Key 00ec332e322e322e39
2e342e312e332e312e312e312e312e322e322e332e312e312e312e312e312e312e312e312e332e332e342e322e312e312e312e322e392e342e312e332e312e312e312e312e322e322e332e312e312e312e312e312e312e312e312e332e332e342e322e312e312e312d3141534b3132303030313437383
53633323030303030393734333234373946323230465354414e444152445f52414e47455f444154453131343739363636363030303030323437393937343346323230465354414e444152445f52414e47455f44415445464c49474854464744454641554c544e4f524d414c5b5d5b48424f42415d5b43
4f52504f524154455f554e4946415245532c20454c454354524f4e49435f5449434b45545f4f4e4c592c204e4f5f4c43435f46415245532c2050415353454e4745525f53414d455f424f4f4b494e475f434f44452c205055424c49534845445f46415245532c205449434b45545f4142494c4954595f4
34845434b2c20554e4946415245535d66616c73657472756530304742 version(30:4) ts:1452037926078 is obsolete, it is no greater than the current version of version(30:4) ts:1452037926077.

Also, 

In Voldemort servers it seems that the error is always the same server, bcn1-cache-vold-095p1

Jan 07 10:45:14 bcn1-cache-vold-095p1 voldemort-server.sh[588]: voldemort.VoldemortException: voldemort.store.UnreachableStoreException: Failure while checking out socket for bcn1-cache-vold-095p1:6666(vp1):

Jan 07 10:45:14 bcn1-cache-vold-095p1 voldemort-server.sh[588]: Caused by: voldemort.store.UnreachableStoreException: Failure while checking out socket for bcn1-cache-vold-095p1:6666(vp1):


 
I don't know if this server has some problem (I checked all configurations, I installed servers with Puppet and all servers should be alike ), or it's because this server is the first server in Cluster.xml file.

Finnaly, I'm not sure for that but it's same that the error apear regulary every 5 o 10 minutes. 

Maybe can be a task in background?

Any ideas?

Thanks!
...

Arunachalam

unread,
Jan 7, 2016, 1:51:54 PM1/7/16
to project-...@googlegroups.com
Sorry Miguel, after the new year, things are little crazy at my work :) , will try to spend some time on it this weekend.

Thanks,
Arun.

--

Miguel Ángel Ausó

unread,
Jan 7, 2016, 1:54:07 PM1/7/16
to project-...@googlegroups.com
No problem, thank you for your interest and support

You received this message because you are subscribed to a topic in the Google Groups "project-voldemort" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/project-voldemort/rkP7gzLCq74/unsubscribe.
To unsubscribe from this group and all its topics, send an email to project-voldem...@googlegroups.com.

Miguel Ausó

unread,
Jan 9, 2016, 4:53:38 AM1/9/16
to project-voldemort
Hi All

For give more info, I detected that the most connection errors appear with the slope process.

Could be that Slop process block de access to clients.


ERROR

Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: [10:25:04,067 voldemort.server.scheduler.slop.StreamingSlopPusherJob] INFO Completed streaming slop pusher job which started at Tue Dec 29 1...cutor$Worker]
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at java.lang.Thread.run(Thread.java:745)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: Caused by: voldemort.store.UnreachableStoreException: Failure while checking out socket for bcn1-cache-vold-095p1:6666(vp1):
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.UnreachableStoreException.wrap(UnreachableStoreException.java:41)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutorPool.checkout(ClientRequestExecutorPool.java:214)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.SocketStore.request(SocketStore.java:278)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.SocketStore.get(SocketStore.java:200)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.SocketStore.get(SocketStore.java:62)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.serialized.SerializingStore.get(SerializingStore.java:107)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.client.AbstractStoreClientFactory.getRemoteMetadata(AbstractStoreClientFactory.java:579)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.client.SocketStoreClientFactory.getRemoteMetadata(SocketStoreClientFactory.java:97)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: ... 16 more
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: Caused by: java.net.ConnectException: ClientRequestExecutor timed out for destination bcn1-cache-vold-095p1:6666(vp1)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutorFactory$1.requestComplete(ClientRequestExecutorFactory.java:210)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.invokeCallback(NonblockingStoreCallbackClientRequest.java:68)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.timeOut(NonblockingStoreCallbackClientRequest.java:128)
Dec 29 10:25:04 bcn1-cache-vold-095p2.  voldemort-server.sh[656]: at voldemort.store.socket.clientrequest.ClientRequestExecutor.completeClientReque


# The ID of *this* particular cluster node

max.threads=20000

http.enable=true
socket.enable=true
# BDB
bdb.write.transactions=false
bdb.flush.transactions=false
bdb.cache.size=17G
bdb.one.env.per.store=true

#NIO connector settings.
enable.nio.connector=true

request.format=vp3
storage.configs=voldemort.store.bdb.BdbStorageConfiguration, voldemort.store.readonly.ReadOnlyStorageConfiguration, voldemort.store.memory.CacheStorageConfiguration

Thanks!

El martes, 29 de diciembre de 2015, 12:49:03 (UTC+1), Miguel Ausó escribió:
...

Arunachalam

unread,
Jan 9, 2016, 5:10:24 PM1/9/16
to project-...@googlegroups.com
Slop is heavy on BDB activity and slop should not happen when all the servers are healthy. Slop only happens when there is a node failure, but it could affect your BDB response times, which can cause a request backlog and other issues as you mentioned.

Thanks,
Arun.


--

Miguel Ausó

unread,
Jan 9, 2016, 6:30:33 PM1/9/16
to project-voldemort
Hi Arun,

How I can detect the node that is not healthy? (I have 16 servers) 

Today I had 301 Slop on each server, the process always fails in server 095p1, maybe it can be because that server is node 0, I don't know, I don't found any peculiarity of this node with the others

I have found that the switches have some discards, perhaps due to increased connections (the new 30 servers)

The connection errors can generate Slop ?, I have the option hinted-handoff-strategy in all tables, can be the reason?

Finally, do you recommend me that I add some option in the server.properties file?

There are many questions, but I have two weeks with this problem.

Thank you!
...

Arunachalam

unread,
Jan 9, 2016, 8:08:32 PM1/9/16
to project-...@googlegroups.com
What is your client version ? Clients before a specific version will always bootstrap and query the metadata updates on Node 0, which causes it to have increased traffic but that was fixed. 


Are you sure your clients are running the version 1.9.18 ?

Thanks,
Arun.



--
Message has been deleted

Miguel Ausó

unread,
Jan 10, 2016, 4:29:45 AM1/10/16
to project-voldemort
Hi Arun , 

Thanks for you reply. 

I'm not sure what is the client version, (Tomorrow I will ask developers), even so the Slope process use the client version on the server, and this version I'm sure that is 1.9.18.

When the clients request the metadata, what does port use it? Admin Port? maybe I need increases the Admin Connections.
...

Arunachalam

unread,
Jan 10, 2016, 12:46:00 PM1/10/16
to project-...@googlegroups.com

The client uses the client port. Also client preferably should be on 1.10.2+ as we fixed some more connection issues on that.

On Jan 10, 2016 1:20 AM, "Miguel Ausó" <migue...@odigeo.com> wrote:
Hi Arun , 

Thanks for you reply. 

I'm not sure what is the client version, (Tomorrow I will ask developers), even so the Slope process use the client version on the server, and this version I'm sure that is 1.9.18.

When the clients request the metadata, what does port use it? Admin Port? maybe I need increases the Admin Connections.



El domingo, 10 de enero de 2016, 2:08:32 (UTC+1), Arun Thirupathi escribió:

Miguel Ausó

unread,
Jan 12, 2016, 11:43:09 AM1/12/16
to project-voldemort
Hi Arun

Well, finally I find the problem, The voldemort cluster it's running on LXC, I have 4 servers and each server has 4 LXC. 

With the new servers filled the ARP table in the Voldemort servers, then I had connection problems. 

The solution has been the  increase to these kernel values (increase the arp limit)

sysctl
-w net.ipv4.neigh.default.gc_thresh3=8192
sysctl -w net.ipv4.neigh.default.gc_thresh2=8192
sysctl -w net.ipv4.neigh.default.gc_thresh1=4096

Thanks you for your support. 
...

Arunachalam

unread,
Jan 12, 2016, 1:32:23 PM1/12/16
to project-...@googlegroups.com
Thanks Miguel for the update. That is interesting.

I would also recommend running the 1.10.2+ clients, as the older clients have non graceful failures which makes things spiral out of control.

Thanks,
Arun.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.

Miguel Ausó

unread,
Jan 13, 2016, 4:58:11 AM1/13/16
to project-voldemort
Hi Arun

we will upgrade the client, do you know if exist a client release notes?


This is  server release notes 

...

Félix GV

unread,
Jan 13, 2016, 9:21:59 AM1/13/16
to project-voldemort
The release notes encompass changes to server, client, scripts and auxiliary processes (i.e.: Build and Push).

-F
--
Reply all
Reply to author
Forward
0 new messages