Cassandra node cannot restart, always OOM

418 views
Skip to first unread message

Sylvanas

unread,
Jan 23, 2018, 8:57:24 AM1/23/18
to DataStax Java Driver for Apache Cassandra User Mailing List
HI All,

I get OOM exception when I tried to restart my Cassandra nodes. My cluster have 6 nodes on 6 virtual machines, it had been running well until three machines went down.Now I cannot start either of them due to OOM exception.
The  virtual machine has 16 cores and 32G memory, and I set the java heap size to 8GB, even I tried to change it to 16G, it was still OOM. I also tried to shut down the remaining nodes and then restart them, but they OOM too.
I am using Cassandra 3.11.0.  Full debug log and config file are in the attachment, and the GC parameters have never been modified.

The summary error log is as follows:
----------------------------------------
INFO  [Service Thread] 2018-01-20 10:36:21,996 GCInspector.java:284 - ParNew GC in 206ms.  CMS Old Gen: 731100272 -> 866123088; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:36:44,170 GCInspector.java:284 - ParNew GC in 204ms.  CMS Old Gen: 1105806808 -> 1242627824; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:36:50,842 GCInspector.java:284 - ParNew GC in 226ms.  CMS Old Gen: 1242627824 -> 1366226864; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:37:21,769 GCInspector.java:284 - ParNew GC in 247ms.  CMS Old Gen: 1745880392 -> 1892560528; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:37:28,701 GCInspector.java:284 - ParNew GC in 289ms.  CMS Old Gen: 1892560528 -> 2019205992; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:38:29,273 GCInspector.java:284 - ParNew GC in 417ms.  CMS Old Gen: 2796078216 -> 2935838968; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:38:37,115 GCInspector.java:284 - ParNew GC in 381ms.  CMS Old Gen: 2935838968 -> 3062603768; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:40:12,620 GCInspector.java:284 - ParNew GC in 519ms.  CMS Old Gen: 4200087496 -> 4367631944; Par Eden Space: 671088640 -> 0; 
INFO  [Service Thread] 2018-01-20 10:41:57,564 GCInspector.java:284 - ConcurrentMarkSweep GC in 254ms.  CMS Old Gen: 5898093904 -> 5897056632; Par Eden Space: 13398000 -> 200870992; 
INFO  [Service Thread] 2018-01-20 10:42:17,650 GCInspector.java:284 - ParNew GC in 571ms.  CMS Old Gen: 6172641376 -> 6344653688; Par Eden Space: 671088640 -> 0; 
WARN  [Service Thread] 2018-01-20 10:44:04,017 GCInspector.java:282 - ConcurrentMarkSweep GC in 13173ms.  CMS Old Gen: 7682874152 -> 7751073456; Par Eden Space: 671088640 -> 141499144; Par Survivor Space: 83886080 -> 0
WARN  [Service Thread] 2018-01-20 10:44:28,713 GCInspector.java:282 - ConcurrentMarkSweep GC in 11924ms.  CMS Old Gen: 7751072680 -> 7751073736; Par Eden Space: 671088640 -> 261939896; Par Survivor Space: 83886016 -> 0
WARN  [Service Thread] 2018-01-20 10:57:21,377 GCInspector.java:282 - ConcurrentMarkSweep GC in 121991ms.  Par Eden Space: 671088640 -> 671088336; Par Survivor Space: 83886080 -> 82360480
INFO  [Service Thread] 2018-01-20 10:57:21,377 StatusLogger.java:47 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-01-20 10:57:21,377 NoSpamLogger.java:94 - Out of 1 commit log syncs over the past 0.00s with average duration of 10005.27ms, 1 have exceeded the configured commit interval by an average of 5.27ms
WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-01-20 11:02:15,079 NoSpamLogger.java:94 - Out of 13 commit log syncs over the past 225.99s with average duration of 10140.66ms, 5 have exceeded the configured commit interval by an average of 9342.16ms
WARN  [Service Thread] 2018-01-20 11:24:34,098 GCInspector.java:282 - ConcurrentMarkSweep GC in 1573676ms.  CMS Old Gen: 7751073616 -> 7751073568; Par Eden Space: 671088640 -> 671088336; Par Survivor Space: 83886064 -> 82542792
ERROR [PERIODIC-COMMIT-LOG-SYNCER] 2018-01-20 11:24:34,117 CassandraDaemon.java:228 - Exception in thread Thread[PERIODIC-COMMIT-LOG-SYNCER,5,main]
java.lang.OutOfMemoryError: Java heap space
ERROR [main] 2018-01-20 11:24:34,116 CassandraDaemon.java:706 - Exception encountered during startup
java.lang.OutOfMemoryError: Java heap space
ERROR [OptionalTasks:1] 2018-01-20 11:24:34,116 CassandraDaemon.java:228 - Exception in thread Thread[OptionalTasks:1,5,main]
java.lang.OutOfMemoryError: Java heap space
----------------------------------------

Can someone know whats happening and how to fix it?

Thanks very much!
debug.log
cassandra.yaml

Nicolas Guyomar

unread,
Jan 24, 2018, 5:30:22 AM1/24/18
to java-dri...@lists.datastax.com
Hi Sylvanas,

You cassandra.yaml is far from the default one, you did change a lot of parameter, so I'll give it a try with a default configuration just to check that you haven't change something critical by mistake ? 

Could you add your jvm.options file for the community to help you ? 

You seem to have a lot of sstable on disk, were you having trouble with compaction no keeping with your write throughput ?

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

Sylvanas

unread,
Jan 25, 2018, 5:34:46 AM1/25/18
to DataStax Java Driver for Apache Cassandra User Mailing List
Hi Nicolas,

Thank you for your reply. I have finally solved the problem.

I found that there are about 4GB data in the system.prepared_statements table in the data directory, which seems to be abnormal. And when I removed the data, I successfully started the node and did not see GC in the debug.log again. Then I removed the data in the system.prepared_statements table on each node, the entire cluster was finally launched successfully. It looks like the tables and the data are still there, but I'm not sure whether there is a small amount of data loss.
I still do not know how the data in the system.prepared_statements is generated and how they cause OOM when the node starts up, and after I removed the data what is the impact.

I have not modified the GC parameters in jvm.options, they are the default configuration.

At the beginning, we had some tables using LeveledCompaction, and we did find a lot of compaction tasks. Then we changed it to SizeCompaction, now there's no compaction stress.


在 2018年1月23日星期二 UTC+8下午9:57:24,Sylvanas写道:
Reply all
Reply to author
Forward
0 new messages