Hi all,
Having an issue with a Reaper run, not sure if it is a bug or a configuration issue.
High level description:
*************************
- Initial run with the (to my knowledge) same configuration completed successfully with some errors in the logs.
- Second run started and finalized about 70% of the total segments when we started to notice:
- high read/write latency on Reaper tables (leader and repair_run)
- increasing number of Pending Compactions (up to 100+ per Cassandra node) despite notifications in Reaper logs that new repair operations were rejected/postponed due to this reason
- finally completely blocking the Cassandra node defined as contact point for Reaper (Connection Pool shutdown) and causing a Cassandra service shutdown due to OOM.
A couple more details:
*************************
Reaper version: 1.4.7
Cassandra version: 3.11.2
Connectivity to Cassandra cluster fine both for 9042 and 7199 ports.
Part of the Reaper config file:
********************************
segmentCountPerNode: 16
repairParallelism: PARALLEL
repairIntensity: 0.9
scheduleDaysBetween: 7
repairRunThreadCount: 30
hangingRepairTimeoutMins: 120
storageType: cassandra
enableCrossOrigin: true
incrementalRepair: false
blacklistTwcsTables: false
enableDynamicSeedList: true
repairManagerSchedulingIntervalSeconds: 10
activateQueryLogger: false
jmxConnectionTimeoutInSeconds: 30
useAddressTranslator: false
purgeRecordsAfterInDays: 0
numberOfRunsToKeepPerUnit: 50
datacenterAvailability: ALL
server:
type: default
applicationConnectors:
- type: http
port: 10080
bindHost: 0.0.0.0
adminConnectors:
- type: http
port: 10081
bindHost: 0.0.0.0
requestLog:
appenders: []
cassandra:
clusterName: "cass_3112"
contactPoints: [10.100.200.126]
port: 9042
keyspace: reaper_db
loadBalancingPolicy:
type: tokenAware
shuffleReplicas: true
subPolicy:
type: dcAwareRoundRobin
localDC:
usedHostsPerRemoteDC: 0
allowRemoteDCsForLocalConsistencyLevel: false
authProvider:
type: plainText
username: cassandra
password: xxxxxxxxx
autoScheduling:
enabled: false
initialDelayPeriod: PT15S
periodBetweenPolls: PT10M
timeBeforeFirstSchedule: PT5M
scheduleSpreadPeriod: PT1M
excludedKeyspaces:
- reaper_db
A couple of Reaper errors, in the order in which they appeared in the logs:
************************************************************************************
---------- This is the error that appeared also in the first repair run, that completed fine ------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ERROR [cass_3112:9ff2acd0-0d1f-11ea-a32c-db1a136dd27e:9ff2faf6-0d1f-11ea-a32c-db1a136dd27e] i.c.s.RepairRunner - Executing SegmentRunner failed
java.lang.AssertionError: Could not release lead on segment 9ff2faf6-0d1f-11ea-a32c-db1a136dd27e
at io.cassandrareaper.storage.CassandraStorage.releaseLead(CassandraStorage.java:1271)
at io.cassandrareaper.service.SegmentRunner.releaseLead(SegmentRunner.java:1150)
at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:146)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
---------- The error from the moment that Cassandra node become unreacheable ------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ERROR [cass_3112-worker-623] c.d.d.c.RequestHandler - Unexpected error while querying /
10.100.200.126com.datastax.driver.core.exceptions.ConnectionException: [/
10.100.200.126:9042] Pool is shutdown
at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:613)
at com.datastax.driver.core.SessionManager.removePool(SessionManager.java:400)
at com.datastax.driver.core.SessionManager.onDown(SessionManager.java:485)
at com.datastax.driver.core.Cluster$Manager.onDown(Cluster.java:1941)
at com.datastax.driver.core.Cluster$Manager.access$1200(Cluster.java:1385)
at com.datastax.driver.core.Cluster$Manager$5.runMayThrow(Cluster.java:1898)
at com.datastax.driver.core.ExceptionCatchingRunnable.run(ExceptionCatchingRunnable.java:32)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
---------- A couple of messages that appeared later in the logs, before Cassandra process crash ----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Adding new JMX Proxy for host
10.100.200.126:7199
ERROR [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Failed creating a new JMX connection to
10.100.200.126:7199java.lang.RuntimeException: io.cassandrareaper.ReaperException: Failure when establishing JMX connection to
10.100.200.126:7199 at io.cassandrareaper.jmx.JmxConnectionFactory$JmxConnectionProvider.apply(JmxConnectionFactory.java:233)
at io.cassandrareaper.jmx.JmxConnectionFactory.connectImpl(JmxConnectionFactory.java:108)
at io.cassandrareaper.jmx.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:142)
at io.cassandrareaper.service.SegmentRunner.runRepair(SegmentRunner.java:233)
at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:144)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.cassandrareaper.ReaperException: Failure when establishing JMX connection to
10.100.200.126:7199 at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:255)
at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:161)
at io.cassandrareaper.jmx.JmxConnectionFactory$JmxConnectionProvider.apply(JmxConnectionFactory.java:227)
... 16 common frames omitted
Caused by: java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at io.cassandrareaper.jmx.JmxProxyImpl.connectWithTimeout(JmxProxyImpl.java:273)
at io.cassandrareaper.jmx.JmxProxyImpl.connect(JmxProxyImpl.java:214)
... 18 common frames omitted
INFO [cass_3112:a74f97e0-0d1f-11ea-a32c-db1a136dd27e:a7500d18-0d1f-11ea-a32c-db1a136dd27e] i.c.j.JmxConnectionFactory - Unreachable host: Failure when establishing JMX connection to
10.100.200.126:7199: null
Thank you for any hint that may help me in identifying the root cause and a possible fix/workaround.