High client CPU usage and java.lang.NoClassDefFoundError

73 views
Skip to first unread message

G Qi

unread,
Jan 8, 2018, 11:25:48 AM1/8/18
to project-voldemort
Hello there.

We run Voldemort server with 4 nodes in the cluster using voldemort-release-1.10.14.

And then we connect Voldemort server through SolrCloud having five nodes.
Each SolrCloud node has two voldemort client created fore each Solr core. In total we have 10 voldemort client over 5 solr nodes.

I have updated Voldemort client jar to the latest 1.10.26.

2018-01-05 11:34:49.842 INFO  (coreLoadExecutor-10-thread-2-processing-n:wp-np2-c0:8983_solr) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.c.s.ClientRegistryRefresher Initial version obtained from client registry: version(0:1, 1:1, 2:3, 3:4) ts:1515152089828
2018-01-05 11:34:49.847 INFO  (coreLoadExecutor-10-thread-2-processing-n:wp-np2-c0:8983_solr) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.c.ZenStoreClient Client registry refresher thread started, refresh interval: 43200 seconds
2018-01-05 11:34:49.847 INFO  (coreLoadExecutor-10-thread-2-processing-n:wp-np2-c0:8983_solr) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.c.ZenStoreClient Voldemort client created: .avro-uniprot@wp-np2-c0:/nfs/public/rw/homes/uni_adm/solrcloud/dist/solr-7.1.0/server

bootstrapTime=1515152089702
context=
deploymentPath=/nfs/public/rw/homes/uni_adm/solrcloud/dist/solr-7.1.0/server
localHostName=wp-np2-c0
sequence=0
storeName=avro-uniprot
updateTime=1515152089383
releaseVersion=null
clusterMetadataVersion=0
bootstrap_urls=[tcp://ves-oy-ea:6666]
max_connections=20
connection_timeout_ms=60000
socket_timeout_ms=60000
routing_timeout_ms=60000
client_zone_id=-1
failuredetector_implementation=voldemort.cluster.failuredetector.ThresholdFailureDetector
failuredetector_threshold=95
failuredetector_threshold_count_minimum=30
failuredetector_threshold_interval=300000
failuredetector_threshold_async_recovery_interval=10000
fetch_all_stores_xml_in_bootstrap=true
idle_connection_timeout_minutes=-1

Every 12 hours, some clients would try to update the the connections.
2018-01-05 23:34:49.848 INFO  (voldemort-scheduler-service1-t2) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.c.s.ClientRegistryRefresher updating client registry with the following info for client: .avro-uniprot@wp-np2-c0:/nfs/public/rw/homes/uni_adm/solrcloud/dist/solr-7.1.0/server

And then we will get the following error information for for most of Vodemort nodes:

2018-01-05 23:34:49.850 INFO  (voldemort-niosocket-client-system-t1) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.s.s.c.ClientRequestExecutor IOException from Destination: ves-oy-ea:6666(vp1) , Socket: Socket[addr=ves-oy-ea/10.3.7.234,port=6666,localport=45426] with message - Connection reset by peer
2018-01-05 23:34:50.263 ERROR (voldemort-niosocket-client-system-t1) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.s.s.c.ClientRequestExecutorFactory$ClientRequestSelectorManager null
java.lang.ExceptionInInitializerError
        at voldemort.serialization.VSlopProto$Slop.<clinit>(VSlopProto.java:495)
        at voldemort.serialization.SlopSerializer.toBytes(SlopSerializer.java:41)
        at voldemort.serialization.SlopSerializer.toBytes(SlopSerializer.java:35)
        at voldemort.store.slop.HintedHandoff.sendHintParallel(HintedHandoff.java:113)
        at voldemort.store.routed.action.PerformParallelPutRequests$1.requestComplete(PerformParallelPutRequests.java:191)
        at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.invokeCallback(NonblockingStoreCallbackClientRequest.java:68)
        at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.complete(NonblockingStoreCallbackClientRequest.java:87)
        at voldemort.store.socket.clientrequest.ClientRequestExecutor.completeClientRequest(ClientRequestExecutor.java:430)
        at voldemort.store.socket.clientrequest.ClientRequestExecutor.close(ClientRequestExecutor.java:250)
        at voldemort.common.nio.SelectorManagerWorker.run(SelectorManagerWorker.java:125)
        at voldemort.common.nio.AbstractSelectorManager.run(AbstractSelectorManager.java:243)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Generated message class "voldemort.serialization.VSlopProto$Slop" missing method "getStoreBytes".
        at com.google.protobuf.GeneratedMessage.getMethodOrDie(GeneratedMessage.java:1971)
        at com.google.protobuf.GeneratedMessage.access$1100(GeneratedMessage.java:61)
        at com.google.protobuf.GeneratedMessage$FieldAccessorTable$SingularStringFieldAccessor.<init>(GeneratedMessage.java:2860)
        at com.google.protobuf.GeneratedMessage$FieldAccessorTable.ensureFieldAccessorsInitialized(GeneratedMessage.java:2108)
        at com.google.protobuf.GeneratedMessage$FieldAccessorTable.<init>(GeneratedMessage.java:2039)
        at voldemort.serialization.VSlopProto$1.assignDescriptors(VSlopProto.java:531)
        at com.google.protobuf.Descriptors$FileDescriptor.internalBuildGeneratedFileFrom(Descriptors.java:355)
        at voldemort.serialization.VSlopProto.<clinit>(VSlopProto.java:539)
        ... 14 more
Caused by: java.lang.NoSuchMethodException: voldemort.serialization.VSlopProto$Slop.getStoreBytes()
        at java.lang.Class.getMethod(Class.java:1786)
        at com.google.protobuf.GeneratedMessage.getMethodOrDie(GeneratedMessage.java:1968)
        ... 21 more

2018-01-05 23:34:50.266 INFO  (voldemort-niosocket-client-system-t1) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.s.s.c.ClientRequestExecutor IOException from Destination: ves-oy-ec.ebi.ac.uk:6666(vp1) , Socket: Socket[addr=ves-oy-ec.ebi.ac.uk/10.3.7.236,port=6666,localport=41358] with message - Connection reset by peer
2018-01-05 23:34:50.266 ERROR (voldemort-niosocket-client-system-t1) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.s.s.c.ClientRequestExecutorFactory$ClientRequestSelectorManager Could not initialize class voldemort.serialization.VSlopProto$Slop
java.lang.NoClassDefFoundError: Could not initialize class voldemort.serialization.VSlopProto$Slop
        at voldemort.serialization.SlopSerializer.toBytes(SlopSerializer.java:41)
        at voldemort.serialization.SlopSerializer.toBytes(SlopSerializer.java:35)
        at voldemort.store.slop.HintedHandoff.sendHintParallel(HintedHandoff.java:113)
        at voldemort.store.routed.action.PerformParallelPutRequests$1.requestComplete(PerformParallelPutRequests.java:191)
        at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.invokeCallback(NonblockingStoreCallbackClientRequest.java:68)
        at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.complete(NonblockingStoreCallbackClientRequest.java:87)
        at voldemort.store.socket.clientrequest.ClientRequestExecutor.completeClientRequest(ClientRequestExecutor.java:430)
        at voldemort.store.socket.clientrequest.ClientRequestExecutor.close(ClientRequestExecutor.java:250)
        at voldemort.common.nio.SelectorManagerWorker.run(SelectorManagerWorker.java:125)
        at voldemort.common.nio.AbstractSelectorManager.run(AbstractSelectorManager.java:243)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

2018-01-05 23:34:50.267 INFO  (voldemort-niosocket-client-system-t1) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.s.s.c.ClientRequestExecutor IOException from Destination: ves-oy-eb.ebi.ac.uk:6666(vp1) , Socket: Socket[addr=ves-oy-eb.ebi.ac.uk/10.3.7.235,port=6666,localport=52260] with message - Connection reset by peer
2018-01-05 23:34:50.267 ERROR (voldemort-niosocket-client-system-t1) [c:uniprot s:shard2 r:core_node7 x:uniprot_shard2_replica_n4] v.s.s.c.ClientRequestExecutorFactory$ClientRequestSelectorManager Could not initialize class voldemort.serialization.VSlopProto$Slop
java.lang.NoClassDefFoundError: Could not initialize class voldemort.serialization.VSlopProto$Slop
        at voldemort.serialization.SlopSerializer.toBytes(SlopSerializer.java:41)
        at voldemort.serialization.SlopSerializer.toBytes(SlopSerializer.java:35)
        at voldemort.store.slop.HintedHandoff.sendHintParallel(HintedHandoff.java:113)
        at voldemort.store.routed.action.PerformParallelPutRequests$1.requestComplete(PerformParallelPutRequests.java:191)
        at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.invokeCallback(NonblockingStoreCallbackClientRequest.java:68)
        at voldemort.store.socket.clientrequest.NonblockingStoreCallbackClientRequest.complete(NonblockingStoreCallbackClientRequest.java:87)
        at voldemort.store.socket.clientrequest.ClientRequestExecutor.completeClientRequest(ClientRequestExecutor.java:430)
        at voldemort.store.socket.clientrequest.ClientRequestExecutor.close(ClientRequestExecutor.java:250)
        at voldemort.common.nio.SelectorManagerWorker.run(SelectorManagerWorker.java:125)
        at voldemort.common.nio.AbstractSelectorManager.run(AbstractSelectorManager.java:243)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

After this point, the voldemort client on SolrCloud nodes will cause the VM with high CPU usage.

From the JVM flight recording, we can see there are several voldemort-niosocket-client-system-t1 running, here is the whole trace:

Stack Trace
voldemort-niosocket-client-system-t1 [79] (RUNNABLE)
   java.lang.Throwable.fillInStackTrace line: not available [native method]
   java.lang.Throwable.fillInStackTrace line: 783 
   java.lang.Throwable.<init> line: 265 
   java.lang.Exception.<init> line: 66 
   java.io.IOException.<init> line: 58 
   java.io.EOFException.<init> line: 62 
   voldemort.store.socket.clientrequest.ClientRequestExecutor.read line: 262 
   voldemort.common.nio.SelectorManagerWorker.run line: 105 
   voldemort.common.nio.AbstractSelectorManager.run line: 243 
   java.util.concurrent.ThreadPoolExecutor.runWorker line: 1142 
   java.util.concurrent.ThreadPoolExecutor$Worker.run line: 617 
   java.lang.Thread.run line: 745 


For me, it seems there is an infinite loop running over here voldemort.common.nio.AbstractSelectorManager.run line: 243.

Could somebody help me to check this problems?

Thanks.

Felix GV

unread,
Jan 8, 2018, 1:41:05 PM1/8/18
to G Qi, project-voldemort
Not sure about the infinite loop, but Protobuf has been notorious in the past in terms of causing backwards incompatibility issues.

Voldemort depends on a certain version of PB for its normal operations, and providing it with a more recent version can cause issues.

For this purpose, there is a target in the gradle build which packages PB with a rewritten package name into a fat jar. This should help alleviate those types of NoClassDefFoundError issues. I suspect this classpath problem may exacerbate some sort of latent bug in a part of the code that is not expecting to fail in this way, which might perhaps in turn cause the infinite loop and increased CPU usage you're observing.

--
Felix GV
Staff Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv


From: project-...@googlegroups.com <project-...@googlegroups.com> on behalf of G Qi <guoy...@gmail.com>
Sent: Monday, January 8, 2018 8:25 AM
To: project-voldemort
Subject: [project-voldemort] High client CPU usage and java.lang.NoClassDefFoundError
 
--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

Arunachalam

unread,
Jan 8, 2018, 1:42:41 PM1/8/18
to project-...@googlegroups.com
From the call stack, I am guessing two things.

1) Voldemort implicitly depends on Protobuf, for serializing writes when some nodes are unavailable. It seems like there is a protobuf version conflict. But this is little puzzling as the Voldemort shades the protobuf and it should not happen.  It seems like this is causing the slops to be tried on lots of nodes and pegging your CPU.
2) Is there a firewall between Voldemort Client and Server ? Firewall could silently drop the connection and Voldmeort client will not be aware of it. If so set the idle_connection_timeout to around 10-15 minutes (Firewall usual timeout is 30 minutes) so that Voldemort kills the connection and has right view of the connection.

Thanks,
Arun.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

G Qi

unread,
Jan 8, 2018, 2:23:38 PM1/8/18
to project-...@googlegroups.com, arunac...@gmail.com, fvil...@linkedin.com
All the nodes are in our internal network and I don’t think there is any firewall between them. 

If there is a protobuf version conflict, I don’t understand why some clients throw this exception and cause high CPU but others don’t. 

Even if after these exception, everything still works normally and we still can get data from Voldemort. The CPU usage will stay the same over days.

Before we are doing these tests, we use Solr but not SolrCloud in our current production. We only have two Voldemort clients on the Solr nodes and this symptoms never happened.

I will definitely try to rebuild the jar with rewritten PB. Flex, can you please let me know the target name of gradle?

Thank you very much!

Guoying

You received this message because you are subscribed to a topic in the Google Groups "project-voldemort" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/project-voldemort/xmgohghrfkc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to project-voldem...@googlegroups.com.

Felix GV

unread,
Jan 8, 2018, 2:27:52 PM1/8/18
to G Qi, project-...@googlegroups.com, arunac...@gmail.com
Hi G,

Classpath conflicts are expected to be inconsistent, because the JVM randomly picks up one jar each time. If you get lucky, you get the right one, and things work fine.

For the task, just look at the Gradle build: https://github.com/voldemort/voldemort/blob/master/build.gradle#L241
voldemort - An open source clone of Amazon's Dynamo.
Good luck.

--
Felix GV
Staff Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv


From: G Qi <guoy...@gmail.com>
Sent: Monday, January 8, 2018 11:23 AM
To: project-...@googlegroups.com
Cc: arunac...@gmail.com; Felix GV
Subject: Re: [project-voldemort] High client CPU usage and java.lang.NoClassDefFoundError
 
Reply all
Reply to author
Forward
0 new messages