I have a keyspace with a small data set but which is doing some heavy-duty search queries against a deeply nested index definition. I've migrated this keyspace from a cluster with Elassandra 5.5.0.24 to one with 6.8.4.7. The number of nodes in the search datacenter and their specifications in the two clusters are the same. The index definitions haven't changed. The one other difference is that this small keyspace has to share the old cluster with a couple keyspaces with many times larger data sets whereas this keyspace has the new cluster all to itself.
So the only substantial difference, to the best of my knowledge, is the Elassandra version. But here's where the behavior differs:
This crashing is always about running out of memory, and typically about running out of heap space. The vanilla Cassandra nodes never crash on either cluster.
Here's how the heap usage will look after I've restarted some nodes:
# nodetool -u dba_admin -pw dba_admin -h west-search-01.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 1786.30 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-02.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 1509.38 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-03.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 1718.96 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-04.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 1299.82 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-05.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 921.51 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-06.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 4570.36 / 31584.00
Here's how the heap usage will look after several hours of letting some of the nodes just run:
# nodetool -u dba_admin -pw dba_admin -h west-search-01.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 18099.37 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-02.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 1840.71 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-03.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 16959.97 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-04.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 16003.95 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-05.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 17072.32 / 31584.00
# nodetool -u dba_admin -pw dba_admin -h west-search-06.foo.bar info|grep "^Heap Memory (MB)"
Heap Memory (MB) : 5375.93 / 31584.00
On the first node, note the increase from 1509.38MB to 18099.37MB. The typical pattern we've seen is that the heap usage just gradually goes up and up and up, over 50%, over 70%, until the server crashes. I haven't seen it go down except when the server crashes. We've produced some graphs over time of the heap usage, but I haven't analyzed those closely.
The RAM utilization has typically been really high, like 90%. There was one occasion when the RAM spiked to 100% and at the same time the heap usage went down sharply and then spiked up to 100% too before it crashed. This was an atypical incident.
Here's how the java command is being invoked with the classpath elided over:
java -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true -Xms31G -Xmx31G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -Des.search_strategy_class=RackAwareSearchStrategy -Dcom.sun.management.jmxremote.access.file=/usr/lib/jvm/zulu-8/jre/lib/management/jmxremote.access -javaagent:/opt/prometheus/jmx_prometheus_javaagent-0.3.1.jar=9192:/opt/prometheus/elassandra-grafana.yml -Xmn1600M -XX:+UseCondCardMark -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -Dcassandra.jmx.remote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -Djdk.io.permissionsUseCanonicalPath=true -Des.distribution.flavor=oss -Des.distribution.type=rpm -Djava.awt.headless=true -XX:OnOutOfMemoryError=kill -9 %p -Dlogback.configurationFile=/etc/cassandra/logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir=/var/lib/cassandra -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp ... org.apache.cassandra.service.ElassandraDaemon
Note that I've temporarily enabled HeapDumpOnOutOfMemoryError, although I haven't checked the output of that yet.
Thanks in advance. I'll post an update if I can figure this out on my end.
P.S. Under other circumstances I would be asking this question with an Elassandra license for commercial support. I'd like to position my team to purchase a support contract, but we'd be in a better position to do so if we got past this blocker.
--
You received this message because you are subscribed to the Google Groups "Elassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elassandra+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elassandra/0e4a8514-34e2-450e-ba97-d9073c2a1ea4n%40googlegroups.com.