Reaper run blocks Cassandra node

397 views
Skip to first unread message

Cristian Zamfirache

unread,
Nov 26, 2019, 8:33:42 AM11/26/19
to TLP Apache Cassandra Reaper users
Hi all,

Having an issue with a Reaper run, not sure if it is a bug or a configuration issue.

High level description:
*************************
- Initial run with the (to my knowledge) same configuration completed successfully with some errors in the logs.
- Second run started and finalized about 70% of the total segments when we started to notice:
   - high read/write latency on Reaper tables (leader and repair_run)
   - increasing number of Pending Compactions (up to 100+ per Cassandra node) despite notifications in Reaper logs that new repair operations were rejected/postponed due to this reason
   - finally completely blocking the Cassandra node defined as contact point for Reaper (Connection Pool shutdown) and causing a Cassandra service shutdown due to OOM.


A couple more details:
*************************

Reaper version: 1.4.7
Cassandra version: 3.11.2
Connectivity to Cassandra cluster fine both for 9042 and 7199 ports.

Part of the Reaper config file:
********************************

segmentCountPerNode: 16
repairParallelism: PARALLEL
repairIntensity: 0.9
scheduleDaysBetween: 7
repairRunThreadCount: 30
hangingRepairTimeoutMins: 120
storageType: cassandra
enableCrossOrigin: true
incrementalRepair: false
blacklistTwcsTables: false
enableDynamicSeedList: true
repairManagerSchedulingIntervalSeconds: 10
activateQueryLogger: false
jmxConnectionTimeoutInSeconds: 30
useAddressTranslator: false
purgeRecordsAfterInDays: 0
numberOfRunsToKeepPerUnit: 50

datacenterAvailability: ALL

server:
  type: default
  applicationConnectors:
    - type: http
      port: 10080
      bindHost: 0.0.0.0
  adminConnectors:
    - type: http
      port: 10081
      bindHost: 0.0.0.0
  requestLog:
    appenders: []


cassandra:
  clusterName: "cass_3112"
  contactPoints: [10.100.200.126]
  port: 9042
  keyspace: reaper_db
  loadBalancingPolicy:
    type: tokenAware
    shuffleReplicas: true
    subPolicy:
      type: dcAwareRoundRobin
      localDC:
      usedHostsPerRemoteDC: 0
      allowRemoteDCsForLocalConsistencyLevel: false
  authProvider:
    type: plainText
    username: cassandra
    password: xxxxxxxxx
autoScheduling:
  enabled: false
  initialDelayPeriod: PT15S
  periodBetweenPolls: PT10M
  timeBeforeFirstSchedule: PT5M
  scheduleSpreadPeriod: PT1M
  excludedKeyspaces:
    - reaper_db




A couple of Reaper errors, in the order in which they appeared in the logs:
************************************************************************************

---------- This is the error that appeared also in the first repair run, that completed fine ------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR  [cass_3112:9ff2acd0-0d1f-11ea-a32c-db1a136dd27e:9ff2faf6-0d1f-11ea-a32c-db1a136dd27e] i.c.s.RepairRunner - Executing SegmentRunner failed
java.lang.AssertionError: Could not release lead on segment 9ff2faf6-0d1f-11ea-a32c-db1a136dd27e
        at io.cassandrareaper.storage.CassandraStorage.releaseLead(CassandraStorage.java:1271)
        at io.cassandrareaper.service.SegmentRunner.releaseLead(SegmentRunner.java:1150)
        at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:146)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
        at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

---------- The error from the moment that Cassandra node become unreacheable  ------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR  [cass_3112-worker-623] c.d.d.c.RequestHandler - Unexpected error while querying /10.100.200.126
com.datastax.driver.core.exceptions.ConnectionException: [/10.100.200.126:9042] Pool is shutdown
        at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:613)
        at com.datastax.driver.core.SessionManager.removePool(SessionManager.java:400)
        at com.datastax.driver.core.SessionManager.onDown(SessionManager.java:485)
        at com.datastax.driver.core.Cluster$Manager.onDown(Cluster.java:1941)
        at com.datastax.driver.core.Cluster$Manager.access$1200(Cluster.java:1385)
        at com.datastax.driver.core.Cluster$Manager$5.runMayThrow(Cluster.java:1898)
        at com.datastax.driver.core.ExceptionCatchingRunnable.run(ExceptionCatchingRunnable.java:32)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)


---------- A couple of messages that appeared later in the logs, before Cassandra process crash ----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

INFO   [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Adding new JMX Proxy for host 10.100.200.126:7199

ERROR  [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Failed creating a new JMX connection to 10.100.200.126:7199
java.lang.RuntimeException: io.cassandrareaper.ReaperException: Failure when establishing JMX connection to 10.100.200.126:7199
        at io.cassandrareaper.jmx.JmxConnectionFactory$JmxConnectionProvider.apply(JmxConnectionFactory.java:233)
        at io.cassandrareaper.jmx.JmxConnectionFactory.connectImpl(JmxConnectionFactory.java:108)
        at io.cassandrareaper.jmx.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:142)
        at io.cassandrareaper.service.SegmentRunner.runRepair(SegmentRunner.java:233)
        at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:144)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
        at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.cassandrareaper.ReaperException: Failure when establishing JMX connection to 10.100.200.126:7199
        at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:255)
        at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:161)
        at io.cassandrareaper.jmx.JmxConnectionFactory$JmxConnectionProvider.apply(JmxConnectionFactory.java:227)
        ... 16 common frames omitted
Caused by: java.util.concurrent.TimeoutException: null
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at io.cassandrareaper.jmx.JmxProxyImpl.connectWithTimeout(JmxProxyImpl.java:273)
        at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:214)
        ... 18 common frames omitted
INFO   [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Unreachable host: Failure when establishing JMX connection to 10.100.200.126:7199: null
 


Thank you for any hint that may help me in identifying the root cause and a possible fix/workaround.

 

Alexander Dejanovski

unread,
Nov 26, 2019, 9:29:07 AM11/26/19
to Cristian Zamfirache, TLP Apache Cassandra Reaper users
Hi Cristian,

did you notice which tables had a lot of pending compactions?
You could be using a number of segments that is too low and generates a lot of overstreaming and large merkle trees. Cassandra 3.11.x is known for generating heavy Merkle trees that can lead to OOM. This was fixed with a new setting to control Merkle trees depth in 3.11.5 : https://issues.apache.org/jira/browse/CASSANDRA-14096

This could very well be happening for specific segments if they go over token ranges that have lots of/large partitions. How much data do you have per node and how many nodes are there in the cluster?
Try doubling the number of segments per node as a first step and see how that goes.

FTR, the contact point is only used to bootstrap the connection of the Datastax Java Driver, the node has no special treatment after the initial connection compared to the rest of the cluster.

-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting


--
You received this message because you are subscribed to the Google Groups "TLP Apache Cassandra Reaper users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/d15d9e15-0a33-4701-9d01-87591186c0e0%40googlegroups.com.

Cristian Zamfirache

unread,
Nov 26, 2019, 10:21:56 AM11/26/19
to TLP Apache Cassandra Reaper users
Hi Alexander,

Thank you for the answer, your suggestion makes a lot of sense and I'll try to check it as quick as possible (will take a bit as we are currently fixing the node after being down for an extended time).

Regarding your questions yes, what we noticed matches your assumptions.
The "culprit" seems to be a table of around 30GB, TWCS compaction strategy, around 100 SSTables,
Partition size is reasonable, 99% under 50K, with a max size of around 2MB.
Cluster consist of 10 nodes, each with around 300GB of data.

One more question that puzzles me (before returning with the result of the suggested test): why would the number of pending compactions increase so much (>100)?
In my understanding Reaper stops submitting repairs for new segments if it notices high numbers of pending compactions.
And there are messages in the log in this direction:
INFO   [cass_3112:a2016890-0d1f-11ea-a32c-db1a136dd27e:a201ddc7-0d1f-11ea-a32c-db1a136dd27e] i.c.s.SegmentRunner - SegmentRunner declined to repair segment a201ddc7-0d1f-11ea-a32c-db1a136dd27e because of too many pending compactions (> 20) on host "10.100.200.126"

Thank you,
Cristian
To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-reaper-users+unsubscribe@googlegroups.com.

Alexander Dejanovski

unread,
Nov 26, 2019, 11:08:36 AM11/26/19
to Cristian Zamfirache, TLP Apache Cassandra Reaper users
Reaper will stop submitting segments when there are lots of pending compactions, but it won't stop the running segments.
One segment could very well trigger the streaming of a lot of tiny sstables, which will create a lot of pending compactions.

We usually advise against repairing twcs tables, because of the behavior you're observing.
You'll want to limit overstraming though, by using more segments.



To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "TLP Apache Cassandra Reaper users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/31716085-603e-421b-9cef-5230cb0337ff%40googlegroups.com.

Reid Pinchback

unread,
Dec 2, 2019, 11:33:56 AM12/2/19
to Cristian Zamfirache, TLP Apache Cassandra Reaper users

Cristian, we had the exact same thing happen with 3.11.4.  From heap dumps we could tell that nodes were being overwhelmed with holding merkel trees in memory.  For us the solution was to back-patch some fixes that were made to 4.0.  I think those same patches now exist 3.11.5.  You might make this problem go away just by upgrading.

 

From: <tlp-apache-cassa...@googlegroups.com> on behalf of Cristian Zamfirache <cristian....@gmail.com>
Date: Tuesday, November 26, 2019 at 8:33 AM
To: TLP Apache Cassandra Reaper users <tlp-apache-cassa...@googlegroups.com>
Subject: Reaper run blocks Cassandra node

 

Message from External Sender

--

Cristian Zamfirache

unread,
Dec 17, 2019, 4:47:47 AM12/17/19
to TLP Apache Cassandra Reaper users
Hi all,

We fixed temporarily the issue by decreasing the segment size (increasing the number of segments) and excluding TWCS tables from the repair.
And we are planning an upgrade to 3.11.5 in order to eliminate the root cause.

Really great and timely support, saved us from a pickle.

Thank you all!
Cristian
Reply all
Reply to author
Forward
0 new messages