Reaper 0.5.0 Error in running repairs

159 views
Skip to first unread message

aiman parvaiz

unread,
Mar 20, 2017, 9:04:09 PM3/20/17
to TLP Apache Cassandra Reaper users
Hi all
Thanks for creating this group first of all. I am running reaper 0.5.0 with Cassandra backend. I have a Cassandra 2.2.8 cluster which I am trying to set up with reaper but I am seeing the following errors/warn messages occasionally:

ERROR  [2017-03-21 00:47:03,761] [cache:55:40235] c.s.r.s.RepairRunner - Executing SegmentRunner failed: timeout value is negative
ERROR  [2017-03-21 00:58:57,999] [cache:54:39903] c.s.r.s.SegmentRunner - SegmentRunner already exists for segment with ID: 39903

I am also getting a WARN message:
WARN   [2017-03-21 00:58:57,999] [cache:54:39903] c.s.r.s.SegmentRunner - Failed to connect to a coordinator node for segment 39903

I have confirmed that all the nodes are listening to 7199 on 0.0.0.0

tcp        0      0 0.0.0.0:7199            0.0.0.0:*               LISTEN      11852/java

This is my reaper config yaml:





segmentCount: 200
repairParallelism: DATACENTER_AWARE
repairIntensity: 0.9
scheduleDaysBetween: 7
repairRunThreadCount: 15
hangingRepairTimeoutMins: 30
storageType: cassandra
enableCrossOrigin: true
incrementalRepair: false
allowUnreachableNodes: false
enableDynamicSeedList: true

jmxPorts:
  127.0.0.1: 7100
  127.0.0.2: 7200
  127.0.0.3: 7300
  127.0.0.4: 7400
  127.0.0.5: 7500
  127.0.0.6: 7600
  127.0.0.7: 7700
  127.0.0.8: 7800

logging:
  level: WARN
  loggers:
    io.dropwizard: WARN
    org.eclipse.jetty: WARN
  appenders:
    - type: console
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"

server:
  type: default
  applicationConnectors:
    - type: http
      port: 8080
      bindHost: 0.0.0.0
  adminConnectors:
    - type: http
      port: 8081
      bindHost: 0.0.0.0
  requestLog:
    appenders: []

cassandra:
  clusterName: "Reaper"
  contactPoints: ["1.1.1.2"]
  keyspace: reaper_db

autoScheduling:
  enabled: false
  initialDelayPeriod: PT15S
  periodBetweenPolls: PT10M
  timeBeforeFirstSchedule: PT5M
  scheduleSpreadPeriod: PT6H
  excludedKeyspaces:
    - keyspace1
    - keyspace2


I would appreciate any help here please.


Thanks
Aiman

Alexander Dejanovski

unread,
Mar 21, 2017, 3:07:53 AM3/21/17
to TLP Apache Cassandra Reaper users
Hi, 

does the repair still progresses when you get that kind of error message ? 
The "Failed to connect to a coordinator node for segment..." message is misleading here and seems to be due to the fact that a SegmentRunner already exists for that segment, not because the node was actually unreachable.
The main issue seems to be that negative timeout reported in the first place, which I cannot currently make sense of.

Any chance you could share a more complete log somewhere ? Having it in DEBUG mode could be helpful too.

Thanks!

Alex

aiman parvaiz

unread,
Mar 21, 2017, 3:01:12 PM3/21/17
to TLP Apache Cassandra Reaper users
Hey Alex
Looking at list-runs and then grabbing repair status of that id does show that the running repairs are progressing, this is list-runs. I will be emailing a 2M size file to your thelastpickle email address, it has DEBUG logging on. Please let me know if that wont work for you.
Thanks

[
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T20:35:24Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T20:42:15Z",
    "estimated_time_of_arrival": null,
    "id": 47,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "no events",
    "owner": "apz",
    "pause_time": "2017-03-20T20:42:16Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 0,
    "start_time": "2017-03-20T20:42:15Z",
    "state": "ABORTED",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T20:35:47Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T20:36:22Z",
    "estimated_time_of_arrival": null,
    "id": 48,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed a segment because no coordinator was reachable",
    "owner": "apz",
    "pause_time": "2017-03-20T20:42:13Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 0,
    "start_time": "2017-03-20T20:36:22Z",
    "state": "ABORTED",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T21:33:45Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T21:34:16Z",
    "estimated_time_of_arrival": null,
    "id": 49,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed a segment because no coordinator was reachable",
    "owner": "apz",
    "pause_time": "2017-03-20T21:41:51Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 24,
    "start_time": "2017-03-20T21:34:16Z",
    "state": "ABORTED",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T21:34:58Z",
    "duration": "0 seconds",
    "end_time": "2017-03-21T18:24:32Z",
    "estimated_time_of_arrival": null,
    "id": 50,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "no events",
    "owner": "apz",
    "pause_time": "2017-03-21T18:24:32Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 0,
    "start_time": "2017-03-21T18:24:32Z",
    "state": "NOT_STARTED",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T21:38:21Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T21:38:51Z",
    "estimated_time_of_arrival": null,
    "id": 51,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed a segment because no coordinator was reachable",
    "owner": "apz",
    "pause_time": "2017-03-20T21:54:41Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 24,
    "start_time": "2017-03-20T21:38:52Z",
    "state": "ABORTED",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T22:14:46Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T22:15:17Z",
    "estimated_time_of_arrival": null,
    "id": 52,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed a segment because no coordinator was reachable",
    "owner": "apz",
    "pause_time": "2017-03-20T22:19:49Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 24,
    "start_time": "2017-03-20T22:15:17Z",
    "state": "ABORTED",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T22:23:19Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T22:23:50Z",
    "estimated_time_of_arrival": null,
    "id": 53,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed a segment because no coordinator was reachable",
    "owner": "apz",
    "pause_time": "2017-03-20T22:45:14Z",
    "repair_parallelism": "SEQUENTIAL",
    "segments_repaired": 24,
    "start_time": "2017-03-20T22:23:51Z",
    "state": "ABORTED",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T22:54:58Z",
    "duration": "10 hours 23 minutes 37 seconds",
    "end_time": "2017-03-21T09:19:07Z",
    "estimated_time_of_arrival": null,
    "id": 54,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "All done",
    "owner": "apz",
    "pause_time": "2017-03-20T22:55:29Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 525,
    "start_time": "2017-03-20T22:55:29Z",
    "state": "DONE",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T23:10:13Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T23:10:43Z",
    "estimated_time_of_arrival": null,
    "id": 55,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed due to affected hosts already doing repairs",
    "owner": "apz",
    "pause_time": "2017-03-20T23:10:43Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 93,
    "start_time": "2017-03-20T23:10:43Z",
    "state": "RUNNING",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T23:11:58Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T23:12:28Z",
    "estimated_time_of_arrival": null,
    "id": 56,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed due to affected hosts already doing repairs",
    "owner": "apz",
    "pause_time": "2017-03-20T23:12:28Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 20,
    "start_time": "2017-03-20T23:12:28Z",
    "state": "RUNNING",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T23:12:52Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T23:13:25Z",
    "estimated_time_of_arrival": null,
    "id": 57,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed a segment because no coordinator was reachable",
    "owner": "apz",
    "pause_time": "2017-03-20T23:13:25Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 62,
    "start_time": "2017-03-20T23:13:25Z",
    "state": "RUNNING",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T23:13:45Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T23:14:15Z",
    "estimated_time_of_arrival": null,
    "id": 58,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed a segment because no coordinator was reachable",
    "owner": "apz",
    "pause_time": "2017-03-20T23:14:15Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 214,
    "start_time": "2017-03-20T23:14:16Z",
    "state": "RUNNING",
    "total_segments": 525
  },
  {
    "cause": "manual spreaper run",
    "cluster_name": "C",
    "column_families": [],
    "creation_time": "2017-03-20T23:14:34Z",
    "duration": "0 seconds",
    "end_time": "2017-03-20T23:15:04Z",
    "estimated_time_of_arrival": null,
    "id": 59,
    "incremental_repair": false,
    "intensity": 0.900,
    "keyspace_name": "A",
    "last_event": "Postponed due to affected hosts already doing repairs",
    "owner": "apz",
    "pause_time": "2017-03-20T23:15:04Z",
    "repair_parallelism": "DATACENTER_AWARE",
    "segments_repaired": 229,
    "start_time": "2017-03-20T23:15:05Z",
    "state": "RUNNING",
    "total_segments": 525
  },

Alexander Dejanovski

unread,
Mar 22, 2017, 9:38:36 AM3/22/17
to TLP Apache Cassandra Reaper users
Hi,

I'm troubled by that list or repair runs.
You seem to have many concurrent runs on the same cluster for the same keyspace (unless you've obfuscated the names). The result is that they are competing with each other to get a repair slot, which creates a lot of postponing. 
That doesn't explain why you're getting negative timers based on intensity though, and I'll patch this in the next release to have strictly positive timeouts.

Could you try to kill all repairs and run a single one at a time ?
Let me know if you have less warnings/errors this way.


Thanks,

aiman parvaiz

unread,
Mar 22, 2017, 10:57:44 AM3/22/17
to TLP Apache Cassandra Reaper users
Hi Alex
Quick question, if I abort a running repair from reaper (abort-repair <id>) would that stop the actual repair process on the node too? The reason I am asking this is that I aborted every running repair even the not started ones but in ops center I am still seeing repairs running(no one else is running repairs on this cluster for sure). I think I am incorrect in assuming that aborting repair from reaper would abort it on the node too and this is why you saw competing repairs because initially before posting here, I had aborted and started repairs on the same KS multiple times.

Thanks

aiman parvaiz

unread,
Mar 22, 2017, 11:06:55 AM3/22/17
to TLP Apache Cassandra Reaper users
Things seems to be running fine after aborting all repairs and starting again(no running repairs in OpsCenter too), I am not seeing anything in the logs (logging level set to WARN). Will run this and keep you posted.

Thanks

Alexander Dejanovski

unread,
Mar 23, 2017, 3:49:35 AM3/23/17
to TLP Apache Cassandra Reaper users
Hi Aiman,

good to know you're back on tracks :)

Aborting a job in Reaper should stop the repair job on the nodes too, but there is a possibility that the nodes don't acknowledge the abortion (pretty classic with incremental repair, even if it's not the case here).

Cheers,

Alex

aiman parvaiz

unread,
Mar 23, 2017, 12:57:06 PM3/23/17
to TLP Apache Cassandra Reaper users
Thanks for all the support Alex, reaper is running with no error/warning for almost a day now and repairs are progressing too. I think I am good here. Thanks again for the help, cheers!
Reply all
Reply to author
Forward
0 new messages