cassandra node often turns down

658 views
Skip to first unread message

pava...@way2online.co.in

unread,
Oct 16, 2017, 9:09:41 AM10/16/17
to DataStax Java Driver for Apache Cassandra User Mailing List
HI,

i had 2 data centers , each contain one node 

my cluster is like 

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  103.18.248.30  3.07 GiB   256        ?             c1a2bf5d-fc63-4e77-bb4d-eea406dc54cd  RACK1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  103.18.248.31  2.28 MiB   256     ?           604ce8a5-291e-4a8a-98ec-0cb460803f8f  RACK1

I have alomost 15 keyspaces...but only three keyspaces contain data..and the data is only stored in one node which is present in DC1

So i changed the configuration as that two nodes present in same data center with different racks...for distributing the data between the two nodes...for that
i decommisioned the 31 node and re added that node to DC1

And i changed  all the keyspaces replication as

ALTER KEYSPACE input_data_profile WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 1};

so now my cluster is like

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  103.18.248.30  3.07 GiB   256          49.2%             c1a2bf5d-fc63-4e77-bb4d-eea406dc54cd  RACK1
UN  103.18.248.31  2.28 MiB   256          50.8%             604ce8a5-291e-4a8a-98ec-0cb460803f8f  RACK2

and i ran nodetool repair on two nodes

but data is not distributed between nodes ....owns shows almost equal distribution but load shows that only 30 server contain data and 31 not contains...

30 server often turns down...to up the server it takes almost 30-40 min time..i thought that all the data present in single node is the reason.... i frustated alot with this issue...so please can anybody 

help me to fix this bug....

Thanks and Regards
pavs




Mohamed Amghari

unread,
Oct 17, 2017, 5:15:44 AM10/17/17
to java-dri...@lists.datastax.com

Hi,
you miss understand the use of the network topology strategy. You should put the replication factor to 2 to replicate the data on 2nodes.

ALTER KEYSPACE input_data_profile WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : <replication factor >};

Hope this help.
Cheers
Mohamed Amghari

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

pava...@way2online.co.in

unread,
Nov 4, 2017, 3:43:20 AM11/4/17
to DataStax Java Driver for Apache Cassandra User Mailing List
HI,

I have 4 nodes in same cluster and all are in same rack.Each node contains almost 100 to 150 Gib data.And I have 20 keyspaces with replication factor 1.only 830 tps are occurring.
I need more transactions per second.For that what i have to change. And there is one more problem with this cluster that is one of the node often turns down. I dont know why cassandra entire cluster often turns down because of one node.I know cassandra is fault tolerant but in my scenario cassandra not satisfies the zero down time.
 
So please help me to overcome my situation

Thanks and regards
pavs


Hemendra kumar

unread,
Nov 4, 2017, 9:21:34 AM11/4/17
to java-dri...@lists.datastax.com
Hi Pavs,

I would recommend to increase your replication factor , using one is decreasing the performance and coordinator need to wait for response from all 4 nodes.

Thanks,
Hemendra

From: pava...@way2online.co.in
Sent: ‎04-‎11-‎2017 13:13
To: DataStax Java Driver for Apache Cassandra User Mailing List
Subject: Re: cassandra node often turns down

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

Jacques-Henri Berthemet

unread,
Nov 6, 2017, 9:58:24 AM11/6/17
to java-dri...@lists.datastax.com

Just to be clear, its because you have replication factor (RF) 1 that you cant have a node down. If you want to allow one node down you need RF 3, if you want to allow 2 nodes down you need RF 5.

 

Here is a site to calculate RF: https://www.ecyrd.com/cassandracalculator/

 

 

 

--

Jacques-Henri Berthemet

Pavani T

unread,
Nov 7, 2017, 5:19:12 AM11/7/17
to java-dri...@lists.datastax.com
Thank you @Hemendra kumar,@Jacques-Henri Berthemet

But i want to know the reason for node down.when i check the cassandra debug.log it seems to be read time out error and fetching limit is high.i am just fetching 500 rows( 280000 cells ) per request.when i researched on it up to 5000 records is possible.More over i want alternative of increasing replication factor for decreasing load size.So can you please suggest something else.

ERROR [Native-Transport-Requests-109] 2017-11-07 13:40:29,117 ErrorMessage.java:349 - Unexpected exception during request
com.google.common.util.concurrent.UncheckedExecutionException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:482) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:459) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:146) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.0.jar:3.9.0]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_144]
at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.0.jar:3.9.0]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_144]
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.RolesCache.getRoles(RolesCache.java:44) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:51) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:86) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.PermissionsCache.lambda$new$1(PermissionsCache.java:37) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.0.jar:3.9.0]
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
... 26 common frames omitted
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:489) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:269) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.RolesCache.lambda$new$197(RolesCache.java:36) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.0.jar:3.9.0]
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
... 40 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:497) ~[apache-cassandra-3.9.0.jar:3.9.0]
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:485) ~[apache-cassandra-3.9.0.jar:3.9.0]
... 47 common frames omitted



Thanks and Regards
pavs


Pavani T

unread,
Nov 9, 2017, 5:47:35 AM11/9/17
to java-dri...@lists.datastax.com
Hi

I am getting this error in cassandra debug.log file..if anybody know how to fix this please help to solve it

ERROR [epollEventLoopGroup-2-51] 2017-11-09 16:09:21,495 Slf4JLogger.java:176 - LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetection.level=advanced' or call ResourceLeakDetector.setLevel() See http://netty.io/wiki/reference-counted-objects.html for more information.

Hemendra kumar

unread,
Nov 9, 2017, 5:57:39 AM11/9/17
to java-dri...@lists.datastax.com
Hi Pav,

What is your server configuration and HEAP_SIZE setting for Cassandra node. Please increase heap size if it is not setup into cassandra-env.sh

Thanks,
Hemendra

From: Pavani T
Sent: ‎09-‎11-‎2017 16:17
To: java-dri...@lists.datastax.com

Subject: Re: cassandra node often turns down

Hi

I am getting this error in cassandra debug.log file..if anybody know how to fix this please help to solve it

ERROR [epollEventLoopGroup-2-51] 2017-11-09 16:09:21,495 Slf4JLogger.java:176 - LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetection.level=advanced' or call ResourceLeakDetector.setLevel() See http://netty.io/wiki/reference-counted-objects.html for more information.

Pavani T

unread,
Nov 9, 2017, 7:45:50 AM11/9/17
to java-dri...@lists.datastax.com
HI hemendra,

I am using G1 garbage collector instead of CMS collector

I have 4 servers

 x.x.x.1 contains-------------------------------------------

MAX_HEAP_SIZE="8G"
HEAP_NEWSIZE="2000M"


OS: CentOS - 7
RAM: 142 GB
Swap: 3 GB
Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Core: 40 Core
Disk: 2.5T 



 x.x.x.2 contains--------------------------------------------

MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4000M"

OS: CentOS - 7
RAM: 125 GB
Swap: 3 GB
Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Core: 40 Core
Disk: 2.2T


 x.x.x.3 contains---------------------------------------------

MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4000M"

OS: CentOS - 7
RAM: 125 GB
Swap: 3 GB
Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Core: 40 Core
Disk: 2 TB

 x.x.x.4 contains-----------------------------------------

MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="1200M"

OS: CentOS - 7
RAM: 125 GB
Swap: 3 GB
Processor: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
Core: 12 Core
Disk: 2.7 TB


jvm options are like this

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/var/log/cassandra/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M


but thing is 2nd server is always turns down.

Hemendra kumar

unread,
Nov 10, 2017, 8:40:51 AM11/10/17
to java-dri...@lists.datastax.com
Hi Pav,

Swapping is not recommended for Cassandra which may cause issue. Please disable swap and verify.

Thanks,
Hemendra

From: Pavani T
Sent: ‎09-‎11-‎2017 18:15

To: java-dri...@lists.datastax.com
Subject: Re: cassandra node often turns down

Pavani T

unread,
Nov 10, 2017, 9:35:33 AM11/10/17
to java-dri...@lists.datastax.com
But those servers are not dedicated to cassandra

On 10-Nov-2017 7:10 pm, "Hemendra kumar" <kumar...@gmail.com> wrote:
Hi Pav,

Swapping is not recommended for Cassandra which may cause issue. Please disable swap and verify.

Thanks,
Hemendra

From: Pavani T
Sent: ‎09-‎11-‎2017 18:15
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

pava...@way2online.co.in

unread,
Dec 20, 2017, 5:15:38 AM12/20/17
to DataStax Java Driver for Apache Cassandra User Mailing List
Hi Hemendra,

I researched  a lot on disabling swap on cassandra without system wide, but i didn't find any procedure to disable it. Can you tell the procedure how to disable swap

Thanks and Regards
pavs 

Hemendra kumar

unread,
Dec 21, 2017, 5:20:09 AM12/21/17
to java-dri...@lists.datastax.com
I Pav,

You can use swapoff -a command on ur Linux machine to disable temporary until next reboot.

Thanks,
Hemendra

Subject: Re: cassandra node often turns down

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

Gintellect Intellect

unread,
Oct 10, 2019, 4:47:18 AM10/10/19
to DataStax Java Driver for Apache Cassandra User Mailing List
Hi Pavs,
It would be good to know if issue related to Operation Timedout was fixed by turning off the swap?

Thanks

On Thursday, December 21, 2017 at 3:50:09 PM UTC+5:30, Hemendra kumar wrote:
I Pav,

You can use swapoff -a command on ur Linux machine to disable temporary until next reboot.

Thanks,
Hemendra

Sent: ‎20-‎12-‎2017 15:45
To: DataStax Java Driver for Apache Cassandra User Mailing List
Subject: Re: cassandra node often turns down

Hi Hemendra,

I researched  a lot on disabling swap on cassandra without system wide, but i didn't find any procedure to disable it. Can you tell the procedure how to disable swap

Thanks and Regards
pavs 

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-dri...@lists.datastax.com.

hemendra kumar

unread,
Oct 10, 2019, 10:46:13 AM10/10/19
to java-dri...@lists.datastax.com
Hi Pavs,

Don't get frustrated, please tell me below answers:

What is your replication factor for all keyspaces ??

Are you using NetworkTopologyStrategy or Simple Strategy for replication ??

Can you confirm me what is your current cluster size ??

Can you please share me your contact no , I will call you and can help you to fix this ?

After adding new node to DC1, have run cleanup on first node ??

Thanks,
Hemendra

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.
Reply all
Reply to author
Forward
0 new messages