Reaper 0.5.0 Error in running repairs

aiman parvaiz

unread,

Mar 20, 2017, 9:04:09 PM3/20/17

to TLP Apache Cassandra Reaper users

Hi all

Thanks for creating this group first of all. I am running reaper 0.5.0 with Cassandra backend. I have a Cassandra 2.2.8 cluster which I am trying to set up with reaper but I am seeing the following errors/warn messages occasionally:

ERROR [2017-03-21 00:47:03,761] [cache:55:40235] c.s.r.s.RepairRunner - Executing SegmentRunner failed: timeout value is negative

ERROR [2017-03-21 00:58:57,999] [cache:54:39903] c.s.r.s.SegmentRunner - SegmentRunner already exists for segment with ID: 39903

I am also getting a WARN message:

WARN [2017-03-21 00:58:57,999] [cache:54:39903] c.s.r.s.SegmentRunner - Failed to connect to a coordinator node for segment 39903

I have confirmed that all the nodes are listening to 7199 on 0.0.0.0

tcp 0 0 0.0.0.0:7199 0.0.0.0:* LISTEN 11852/java

This is my reaper config yaml:

segmentCount: 200

repairParallelism: DATACENTER_AWARE

repairIntensity: 0.9

scheduleDaysBetween: 7

repairRunThreadCount: 15

hangingRepairTimeoutMins: 30

storageType: cassandra

enableCrossOrigin: true

incrementalRepair: false

allowUnreachableNodes: false

enableDynamicSeedList: true

jmxPorts:

127.0.0.1: 7100

127.0.0.2: 7200

127.0.0.3: 7300

127.0.0.4: 7400

127.0.0.5: 7500

127.0.0.6: 7600

127.0.0.7: 7700

127.0.0.8: 7800

logging:

level: WARN

loggers:

io.dropwizard: WARN

org.eclipse.jetty: WARN

appenders:

- type: console

logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"

server:

type: default

applicationConnectors:

- type: http

port: 8080

bindHost: 0.0.0.0

adminConnectors:

- type: http

port: 8081

bindHost: 0.0.0.0

requestLog:

appenders: []

cassandra:

clusterName: "Reaper"

contactPoints: ["1.1.1.2"]

keyspace: reaper_db

autoScheduling:

enabled: false

initialDelayPeriod: PT15S

periodBetweenPolls: PT10M

timeBeforeFirstSchedule: PT5M

scheduleSpreadPeriod: PT6H

excludedKeyspaces:

- keyspace1

- keyspace2

I would appreciate any help here please.

Thanks

Aiman

Alexander Dejanovski

unread,

Mar 21, 2017, 3:07:53 AM3/21/17

to TLP Apache Cassandra Reaper users

Hi,

does the repair still progresses when you get that kind of error message ?

The "Failed to connect to a coordinator node for segment..." message is misleading here and seems to be due to the fact that a SegmentRunner already exists for that segment, not because the node was actually unreachable.

The main issue seems to be that negative timeout reported in the first place, which I cannot currently make sense of.

Any chance you could share a more complete log somewhere ? Having it in DEBUG mode could be helpful too.

Thanks!

Alex

aiman parvaiz

unread,

Mar 21, 2017, 3:01:12 PM3/21/17

to TLP Apache Cassandra Reaper users

Hey Alex

Looking at list-runs and then grabbing repair status of that id does show that the running repairs are progressing, this is list-runs. I will be emailing a 2M size file to your thelastpickle email address, it has DEBUG logging on. Please let me know if that wont work for you.

Thanks

[

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T20:35:24Z",

"duration": "0 seconds",

"end_time": "2017-03-20T20:42:15Z",

"estimated_time_of_arrival": null,

"id": 47,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "no events",

"owner": "apz",

"pause_time": "2017-03-20T20:42:16Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 0,

"start_time": "2017-03-20T20:42:15Z",

"state": "ABORTED",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T20:35:47Z",

"duration": "0 seconds",

"end_time": "2017-03-20T20:36:22Z",

"estimated_time_of_arrival": null,

"id": 48,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed a segment because no coordinator was reachable",

"owner": "apz",

"pause_time": "2017-03-20T20:42:13Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 0,

"start_time": "2017-03-20T20:36:22Z",

"state": "ABORTED",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T21:33:45Z",

"duration": "0 seconds",

"end_time": "2017-03-20T21:34:16Z",

"estimated_time_of_arrival": null,

"id": 49,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed a segment because no coordinator was reachable",

"owner": "apz",

"pause_time": "2017-03-20T21:41:51Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 24,

"start_time": "2017-03-20T21:34:16Z",

"state": "ABORTED",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T21:34:58Z",

"duration": "0 seconds",

"end_time": "2017-03-21T18:24:32Z",

"estimated_time_of_arrival": null,

"id": 50,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "no events",

"owner": "apz",

"pause_time": "2017-03-21T18:24:32Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 0,

"start_time": "2017-03-21T18:24:32Z",

"state": "NOT_STARTED",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T21:38:21Z",

"duration": "0 seconds",

"end_time": "2017-03-20T21:38:51Z",

"estimated_time_of_arrival": null,

"id": 51,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed a segment because no coordinator was reachable",

"owner": "apz",

"pause_time": "2017-03-20T21:54:41Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 24,

"start_time": "2017-03-20T21:38:52Z",

"state": "ABORTED",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T22:14:46Z",

"duration": "0 seconds",

"end_time": "2017-03-20T22:15:17Z",

"estimated_time_of_arrival": null,

"id": 52,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed a segment because no coordinator was reachable",

"owner": "apz",

"pause_time": "2017-03-20T22:19:49Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 24,

"start_time": "2017-03-20T22:15:17Z",

"state": "ABORTED",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T22:23:19Z",

"duration": "0 seconds",

"end_time": "2017-03-20T22:23:50Z",

"estimated_time_of_arrival": null,

"id": 53,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed a segment because no coordinator was reachable",

"owner": "apz",

"pause_time": "2017-03-20T22:45:14Z",

"repair_parallelism": "SEQUENTIAL",

"segments_repaired": 24,

"start_time": "2017-03-20T22:23:51Z",

"state": "ABORTED",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T22:54:58Z",

"duration": "10 hours 23 minutes 37 seconds",

"end_time": "2017-03-21T09:19:07Z",

"estimated_time_of_arrival": null,

"id": 54,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "All done",

"owner": "apz",

"pause_time": "2017-03-20T22:55:29Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 525,

"start_time": "2017-03-20T22:55:29Z",

"state": "DONE",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T23:10:13Z",

"duration": "0 seconds",

"end_time": "2017-03-20T23:10:43Z",

"estimated_time_of_arrival": null,

"id": 55,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed due to affected hosts already doing repairs",

"owner": "apz",

"pause_time": "2017-03-20T23:10:43Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 93,

"start_time": "2017-03-20T23:10:43Z",

"state": "RUNNING",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T23:11:58Z",

"duration": "0 seconds",

"end_time": "2017-03-20T23:12:28Z",

"estimated_time_of_arrival": null,

"id": 56,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed due to affected hosts already doing repairs",

"owner": "apz",

"pause_time": "2017-03-20T23:12:28Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 20,

"start_time": "2017-03-20T23:12:28Z",

"state": "RUNNING",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T23:12:52Z",

"duration": "0 seconds",

"end_time": "2017-03-20T23:13:25Z",

"estimated_time_of_arrival": null,

"id": 57,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed a segment because no coordinator was reachable",

"owner": "apz",

"pause_time": "2017-03-20T23:13:25Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 62,

"start_time": "2017-03-20T23:13:25Z",

"state": "RUNNING",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T23:13:45Z",

"duration": "0 seconds",

"end_time": "2017-03-20T23:14:15Z",

"estimated_time_of_arrival": null,

"id": 58,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed a segment because no coordinator was reachable",

"owner": "apz",

"pause_time": "2017-03-20T23:14:15Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 214,

"start_time": "2017-03-20T23:14:16Z",

"state": "RUNNING",

"total_segments": 525

},

{

"cause": "manual spreaper run",

"cluster_name": "C",

"column_families": [],

"creation_time": "2017-03-20T23:14:34Z",

"duration": "0 seconds",

"end_time": "2017-03-20T23:15:04Z",

"estimated_time_of_arrival": null,

"id": 59,

"incremental_repair": false,

"intensity": 0.900,

"keyspace_name": "A",

"last_event": "Postponed due to affected hosts already doing repairs",

"owner": "apz",

"pause_time": "2017-03-20T23:15:04Z",

"repair_parallelism": "DATACENTER_AWARE",

"segments_repaired": 229,

"start_time": "2017-03-20T23:15:05Z",

"state": "RUNNING",

"total_segments": 525

},

Alexander Dejanovski

unread,

Mar 22, 2017, 9:38:36 AM3/22/17

to TLP Apache Cassandra Reaper users

Hi,

I'm troubled by that list or repair runs.

You seem to have many concurrent runs on the same cluster for the same keyspace (unless you've obfuscated the names). The result is that they are competing with each other to get a repair slot, which creates a lot of postponing.

That doesn't explain why you're getting negative timers based on intensity though, and I'll patch this in the next release to have strictly positive timeouts.

Could you try to kill all repairs and run a single one at a time ?

Let me know if you have less warnings/errors this way.

Thanks,

aiman parvaiz

unread,

Mar 22, 2017, 10:57:44 AM3/22/17

to TLP Apache Cassandra Reaper users

Hi Alex

Quick question, if I abort a running repair from reaper (abort-repair <id>) would that stop the actual repair process on the node too? The reason I am asking this is that I aborted every running repair even the not started ones but in ops center I am still seeing repairs running(no one else is running repairs on this cluster for sure). I think I am incorrect in assuming that aborting repair from reaper would abort it on the node too and this is why you saw competing repairs because initially before posting here, I had aborted and started repairs on the same KS multiple times.

Thanks

aiman parvaiz

unread,

Mar 22, 2017, 11:06:55 AM3/22/17

to TLP Apache Cassandra Reaper users

Things seems to be running fine after aborting all repairs and starting again(no running repairs in OpsCenter too), I am not seeing anything in the logs (logging level set to WARN). Will run this and keep you posted.

Thanks

Alexander Dejanovski

unread,

Mar 23, 2017, 3:49:35 AM3/23/17

to TLP Apache Cassandra Reaper users

Hi Aiman,

good to know you're back on tracks :)

Aborting a job in Reaper should stop the repair job on the nodes too, but there is a possibility that the nodes don't acknowledge the abortion (pretty classic with incremental repair, even if it's not the case here).

Cheers,

Alex

aiman parvaiz

unread,

Mar 23, 2017, 12:57:06 PM3/23/17

to TLP Apache Cassandra Reaper users

Thanks for all the support Alex, reaper is running with no error/warning for almost a day now and repairs are progressing too. I think I am good here. Thanks again for the help, cheers!

Reply all

Reply to author

Forward