Reaper run blocks Cassandra node

Cristian Zamfirache

unread,

Nov 26, 2019, 8:33:42 AM11/26/19

to TLP Apache Cassandra Reaper users

Hi all,

Having an issue with a Reaper run, not sure if it is a bug or a configuration issue.

High level description:

*************************

- Initial run with the (to my knowledge) same configuration completed successfully with some errors in the logs.

- Second run started and finalized about 70% of the total segments when we started to notice:

- high read/write latency on Reaper tables (leader and repair_run)

- increasing number of Pending Compactions (up to 100+ per Cassandra node) despite notifications in Reaper logs that new repair operations were rejected/postponed due to this reason

- finally completely blocking the Cassandra node defined as contact point for Reaper (Connection Pool shutdown) and causing a Cassandra service shutdown due to OOM.

A couple more details:

*************************

Reaper version: 1.4.7

Cassandra version: 3.11.2

Connectivity to Cassandra cluster fine both for 9042 and 7199 ports.

Part of the Reaper config file:

********************************

segmentCountPerNode: 16

repairParallelism: PARALLEL

repairIntensity: 0.9

scheduleDaysBetween: 7

repairRunThreadCount: 30

hangingRepairTimeoutMins: 120

storageType: cassandra

enableCrossOrigin: true

incrementalRepair: false

blacklistTwcsTables: false

enableDynamicSeedList: true

repairManagerSchedulingIntervalSeconds: 10

activateQueryLogger: false

jmxConnectionTimeoutInSeconds: 30

useAddressTranslator: false

purgeRecordsAfterInDays: 0

numberOfRunsToKeepPerUnit: 50

datacenterAvailability: ALL

server:

type: default

applicationConnectors:

- type: http

port: 10080

bindHost: 0.0.0.0

adminConnectors:

- type: http

port: 10081

bindHost: 0.0.0.0

requestLog:

appenders: []

cassandra:

clusterName: "cass_3112"

contactPoints: [10.100.200.126]

port: 9042

keyspace: reaper_db

loadBalancingPolicy:

type: tokenAware

shuffleReplicas: true

subPolicy:

type: dcAwareRoundRobin

localDC:

usedHostsPerRemoteDC: 0

allowRemoteDCsForLocalConsistencyLevel: false

authProvider:

type: plainText

username: cassandra

password: xxxxxxxxx

autoScheduling:

enabled: false

initialDelayPeriod: PT15S

periodBetweenPolls: PT10M

timeBeforeFirstSchedule: PT5M

scheduleSpreadPeriod: PT1M

excludedKeyspaces:

- reaper_db

A couple of Reaper errors, in the order in which they appeared in the logs:

************************************************************************************

---------- This is the error that appeared also in the first repair run, that completed fine ------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR [cass_3112:9ff2acd0-0d1f-11ea-a32c-db1a136dd27e:9ff2faf6-0d1f-11ea-a32c-db1a136dd27e] i.c.s.RepairRunner - Executing SegmentRunner failed

java.lang.AssertionError: Could not release lead on segment 9ff2faf6-0d1f-11ea-a32c-db1a136dd27e

at io.cassandrareaper.storage.CassandraStorage.releaseLead(CassandraStorage.java:1271)

at io.cassandrareaper.service.SegmentRunner.releaseLead(SegmentRunner.java:1150)

at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:146)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)

at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)

at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)

at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

---------- The error from the moment that Cassandra node become unreacheable ------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR [cass_3112-worker-623] c.d.d.c.RequestHandler - Unexpected error while querying /10.100.200.126

com.datastax.driver.core.exceptions.ConnectionException: [/10.100.200.126:9042] Pool is shutdown

at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:613)

at com.datastax.driver.core.SessionManager.removePool(SessionManager.java:400)

at com.datastax.driver.core.SessionManager.onDown(SessionManager.java:485)

at com.datastax.driver.core.Cluster$Manager.onDown(Cluster.java:1941)

at com.datastax.driver.core.Cluster$Manager.access$1200(Cluster.java:1385)

at com.datastax.driver.core.Cluster$Manager$5.runMayThrow(Cluster.java:1898)

at com.datastax.driver.core.ExceptionCatchingRunnable.run(ExceptionCatchingRunnable.java:32)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)

at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)

at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)

at java.lang.Thread.run(Thread.java:748)

---------- A couple of messages that appeared later in the logs, before Cassandra process crash ----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

INFO [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Adding new JMX Proxy for host 10.100.200.126:7199

ERROR [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Failed creating a new JMX connection to 10.100.200.126:7199

java.lang.RuntimeException: io.cassandrareaper.ReaperException: Failure when establishing JMX connection to 10.100.200.126:7199

at io.cassandrareaper.jmx.JmxConnectionFactory$JmxConnectionProvider.apply(JmxConnectionFactory.java:233)

at io.cassandrareaper.jmx.JmxConnectionFactory.connectImpl(JmxConnectionFactory.java:108)

at io.cassandrareaper.jmx.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:142)

at io.cassandrareaper.service.SegmentRunner.runRepair(SegmentRunner.java:233)

at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:144)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)

at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)

at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)

at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: io.cassandrareaper.ReaperException: Failure when establishing JMX connection to 10.100.200.126:7199

at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:255)

at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:161)

at io.cassandrareaper.jmx.JmxConnectionFactory$JmxConnectionProvider.apply(JmxConnectionFactory.java:227)

... 16 common frames omitted

Caused by: java.util.concurrent.TimeoutException: null

at java.util.concurrent.FutureTask.get(FutureTask.java:205)

at io.cassandrareaper.jmx.JmxProxyImpl.connectWithTimeout(JmxProxyImpl.java:273)

at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:214)

... 18 common frames omitted

INFO [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Unreachable host: Failure when establishing JMX connection to 10.100.200.126:7199: null

Thank you for any hint that may help me in identifying the root cause and a possible fix/workaround.

Alexander Dejanovski

unread,

Nov 26, 2019, 9:29:07 AM11/26/19

to Cristian Zamfirache, TLP Apache Cassandra Reaper users

Hi Cristian,

did you notice which tables had a lot of pending compactions?

You could be using a number of segments that is too low and generates a lot of overstreaming and large merkle trees. Cassandra 3.11.x is known for generating heavy Merkle trees that can lead to OOM. This was fixed with a new setting to control Merkle trees depth in 3.11.5 : https://issues.apache.org/jira/browse/CASSANDRA-14096

This could very well be happening for specific segments if they go over token ranges that have lots of/large partitions. How much data do you have per node and how many nodes are there in the cluster?

Try doubling the number of segments per node as a first step and see how that goes.

FTR, the contact point is only used to bootstrap the connection of the Datastax Java Driver, the node has no special treatment after the initial connection compared to the rest of the cluster.

-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting

http://www.thelastpickle.com

--
You received this message because you are subscribed to the Google Groups "TLP Apache Cassandra Reaper users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/d15d9e15-0a33-4701-9d01-87591186c0e0%40googlegroups.com.

Cristian Zamfirache

unread,

Nov 26, 2019, 10:21:56 AM11/26/19

to TLP Apache Cassandra Reaper users

Hi Alexander,

Thank you for the answer, your suggestion makes a lot of sense and I'll try to check it as quick as possible (will take a bit as we are currently fixing the node after being down for an extended time).

Regarding your questions yes, what we noticed matches your assumptions.

The "culprit" seems to be a table of around 30GB, TWCS compaction strategy, around 100 SSTables,

Partition size is reasonable, 99% under 50K, with a max size of around 2MB.

Cluster consist of 10 nodes, each with around 300GB of data.

One more question that puzzles me (before returning with the result of the suggested test): why would the number of pending compactions increase so much (>100)?

In my understanding Reaper stops submitting repairs for new segments if it notices high numbers of pending compactions.

And there are messages in the log in this direction:

INFO [cass_3112:a2016890-0d1f-11ea-a32c-db1a136dd27e:a201ddc7-0d1f-11ea-a32c-db1a136dd27e] i.c.s.SegmentRunner - SegmentRunner declined to repair segment a201ddc7-0d1f-11ea-a32c-db1a136dd27e because of too many pending compactions (> 20) on host "10.100.200.126"

Thank you,

Cristian

To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-reaper-users+unsubscribe@googlegroups.com.

Alexander Dejanovski

unread,

Nov 26, 2019, 11:08:36 AM11/26/19

to Cristian Zamfirache, TLP Apache Cassandra Reaper users

Reaper will stop submitting segments when there are lots of pending compactions, but it won't stop the running segments.

One segment could very well trigger the streaming of a lot of tiny sstables, which will create a lot of pending compactions.

We usually advise against repairing twcs tables, because of the behavior you're observing.

You'll want to limit overstraming though, by using more segments.

To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/d15d9e15-0a33-4701-9d01-87591186c0e0%40googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "TLP Apache Cassandra Reaper users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/31716085-603e-421b-9cef-5230cb0337ff%40googlegroups.com.

Reid Pinchback

unread,

Dec 2, 2019, 11:33:56 AM12/2/19

to Cristian Zamfirache, TLP Apache Cassandra Reaper users

Cristian, we had the exact same thing happen with 3.11.4. From heap dumps we could tell that nodes were being overwhelmed with holding merkel trees in memory. For us the solution was to back-patch some fixes that were made to 4.0. I think those same patches now exist 3.11.5. You might make this problem go away just by upgrading.

From: <tlp-apache-cassa...@googlegroups.com> on behalf of Cristian Zamfirache <cristian....@gmail.com>
Date: Tuesday, November 26, 2019 at 8:33 AM
To: TLP Apache Cassandra Reaper users <tlp-apache-cassa...@googlegroups.com>
Subject: Reaper run blocks Cassandra node

Message from External Sender

--

Cristian Zamfirache

unread,

Dec 17, 2019, 4:47:47 AM12/17/19

to TLP Apache Cassandra Reaper users

Hi all,

We fixed temporarily the issue by decreasing the segment size (increasing the number of segments) and excluding TWCS tables from the repair.

And we are planning an upgrade to 3.11.5 in order to eliminate the root cause.

Really great and timely support, saved us from a pickle.

Thank you all!

Cristian

Reply all

Reply to author

Forward