Toooooo many handsahke retry leads cassandra server Hang

1,019 views
Skip to first unread message

ZieNie Khan

unread,
Feb 5, 2015, 1:25:51 AM2/5/15
to java-dri...@lists.datastax.com


I'm running cassandra cluster with 24 node.
cassandra server version is 2.1.2

yesterday 1 node was down because of hardware fault.
and after about 10 hours ... all server gone hang with "Too many open files"

In my cassandra system.log,  there were too many and too frequently handshake try exists. 
X.X.X.X ip is downed server with hardware falut.
and at last it leads a "Too many open files" and hang server.
I think it probably unclosed client socket.

Is it a cassandra BUG? or my mistake(use) in config file or etc.

some one can help me? 




INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,463 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,463 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,463 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,463 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,463 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,464 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,464 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,464 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,464 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,464 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,465 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,465 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,465 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,465 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,465 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,465 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,466 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,466 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,466 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:16,466 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
....
..... this line lasts million times ~ and at last
.....

INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,469 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,470 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,470 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,470 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,470 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,470 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,470 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,471 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,471 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,471 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,471 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,471 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,471 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,472 OutboundTcpConnection.java:429 - Handshaking version with /X.X.X.X
INFO  [HANDSHAKE-/X.X.X.X] 2015-02-05 04:56:22,472 OutboundTcpConnection.java:438 - Cannot handshake version with /X.X.X.X
WARN  [SharedPool-Worker-25] 2015-02-05 04:56:22,525 AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread Thread[SharedPool-Worker-25,5,main]: {}
java.lang.RuntimeException: java.lang.RuntimeException: java.io.FileNotFoundException: /home/dev1/lib/cassandra2/data/user1/user_key-0febb330962c11e4b3d39dcaec8ca56f/user1-user_key-ka-4488-Data.db (Too many open files)
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2084) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_67]
        at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.2.jar:2.1.2]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]



here's  ulimit.

[de...@csdr001.u1 logs]$  ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 385578
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 32768
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited



Mohammed Guller

unread,
Feb 5, 2015, 5:52:47 PM2/5/15
to java-dri...@lists.datastax.com

Do you have an application that is only writing data but not reading anything from C*?

 

I suspect that minor compaction is not getting triggered. So over a period of time, C* will end up several thousand SSTable files. That is why you are getting the “Too many open files” errors.

 

You need to set 'cold_reads_to_omit': to 0 for all the CFs. Here is the complete statement

 

ALTER TABLE cfName WITH compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'cold_reads_to_omit': 0.0};

 

Replace cfName with the name of your table and do this for every table in your keyspace. This trigger minor compaction and SSTable count should go down to a few hundred files (depending on how much data you have).

 

Mohammed

To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

ZieNie Khan

unread,
Feb 6, 2015, 3:11:20 AM2/6/15
to java-dri...@lists.datastax.com

thanks for your reply  Mohammed Guller 


Do you have an application that is only writing data but not reading anything from C*? 

> No, In peak time avg read is 15,0000/sec, and avg write is 30,000/sec

 

I suspect that minor compaction is not getting triggered. So over a period of time, C* will end up several thousand SSTable files. 

That is why you are getting the “Too many open files” errors.

>  I'm using SizeTiredCompaction. and data directory has under 1000 files of data file ( *.db)

> I think,  handshake retry causing a exauste of file descriptions.

> In normal condition. there were few retry under 3~ 5 times.

> but this time ... it's retry count is over 10,000 times, while timeout second(5 sec.) reached.


 








You need to set 'cold_reads_to_omit': to 0 for all the CFs. Here is the complete statement

 

ALTER TABLE cfName WITH compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'cold_reads_to_omit': 0.0};

 

Replace cfName with the name of your table and do this for every table in your keyspace. This trigger minor compaction and SSTable count should go down to a few hundred files (depending on how much data you have).

--
웃으면... 웃을일이 생긴다.
Reply all
Reply to author
Forward
0 new messages