Reaper 1.4.1 Slow Performance: ~ 20 Segments/Day

284 views
Skip to first unread message

FH

unread,
Feb 5, 2020, 3:04:22 PM2/5/20
to TLP Apache Cassandra Reaper users
Hi ...
This is our first experience with reaper. We  deployed 1.4.1 to manage repairs on a C* 2.2.8 cluster of 21 nodes & 12 TB of data. When repair starts, we get about 100 segments per day then it slows down immensely to near 20. These are the throughput metrics of reaper (see attached)

These are the details for the current repair schedule

D a12f14b0-42e6-11ea-b317-cb547f0d5fd7
Owner auto-scheduling
Cause scheduled run (schedule id cd5ec820-1dd2-11ea-8cbe-1773a5537526)
Last event Triggered repair of segment a1338181-42e6-11ea-b317-cb547f0d5fd7 via host 10.241.208.39
Start time January 29, 2020 5:28 PM
End time
Pause time
Duration 6 days 21 hours 30 minutes 9 seconds
Segment count 697
Segment repaired 372
Intensity 0.8999999761581421
Repair parallelism DATACENTER_AWARE
Incremental repair false
Repair threads 1
Nodes
Datacenters
Blacklist
Creation time January 29, 2020 5:28 PM

Most common message I see in the reaper log is 

547f0d5fd7:a13b97dc-42e6-11ea-b317-cb547f0d5fd7] i.c.s.SegmentRunner - SegmentRunner declined to repair segment a13b97dc-42e6-11ea-b317-cb547f0d5fd7 because one of the hosts (10.xxxx) was already involved in a repair 
INFO   [neptunecluster:a12f14b0-42e6-11ea-b317-cb547f0d5fd7:a13b97dc-42e6-11ea-b317-cb547f0d5fd7] i.c.s.SegmentRunner - Cannot run segment a13b97dc-42e6-11ea-b317-cb547f0d5fd7 for repair a12f14b0-42e6-11ea-b317-cb547f0d5fd7 at the moment. Will try again later 
INFO   [neptunecluster:a12f14b0-42e6-11ea-b317-cb547f0d5fd7] i.c.s.RepairRunner - Running segment for range (6901533759667616261,-9216315244547356850] 
INFO   [neptunecluster:a12f14b0-42e6-11ea-b317-cb547f0d5fd7] i.c.s.RepairRunner - Next segment to run : a13e2fe1-42e6-11ea-b317-cb547f0d5fd7 
INFO   [neptunecluster:a12f14b0-42e6-11ea-b317-cb547f0d5fd7:a13e2fe1-42e6-11ea-b317-cb547f0d5fd7] i.c.s.SegmentRunner - It is ok to repair segment 'a13e2fe1-42e6-11ea-b317-cb547f0d5fd7' on repair run with id 'a12f14b0-42e6-11ea-b317-cb547f0d5fd7' 
INFO   [neptunecluster:a12f14b0-42e6-11ea-b317-cb547f0d5fd7:a13e2fe1-42e6-11ea-b317-cb547f0d5fd7] i.c.j.JmxProxy - Triggering repair of range (8345918423518772099,8346839897102676558] for keyspace "overlordpreprod" on host 10.241.208.125, with repair parallelism dc_parallel, in cluster with Cassandra version '2.2.8' (can use DATACENTER_AWARE 'true'), for column families: [document, tracking, config] 
INFO   [neptunecluster:a12f14b0


1. "hosts (10.xxxx) was already involved in a repair ". I can confirm there is no other non-reaper repairs running in the cluster. Why is reaper declining the repairs?
2. Is this throughput expected, what can be done to improve it? repair intensity & parallelism? 
3. I integrated reaper with datadog. I'm unable to find a metric for 'fail count'. Is there one? 

I'll be more than happy to provide the reaper log if requested.

Thanks 
Screenshot - Devops-rds-cassandra-reaper _ Datadog - Google Chrome - 2_5_2020 , 2_43_40 PM.png

FH

unread,
Feb 5, 2020, 4:05:51 PM2/5/20
to TLP Apache Cassandra Reaper users
More current settings ...
repairRunThreadCount: 15
repairManagerSchedulingIntervalSeconds: 10
repairthread 4
repairintensity 1

Alexander Dejanovski

unread,
Feb 6, 2020, 2:10:36 AM2/6/20
to FH, TLP Apache Cassandra Reaper users
Hi,

sometimes nodes will remain in an unclean state where they believe a repair is still running. This could lead Reaper to think the same and stop running segments that will involve this node.
You can check the "nodetool tpstats" output to see if there's a trace of a repair on the node that is reported by Reaper as being already involved in a repair. To fix this, restart Cassandra on that node.
Could you also check the duration of each segment to see if they may reach the 30 minutes threshold? If so, you may get a lot of canceled segments that get retried afterwards, explaining why progress is slow. In this situation, you can either raise the timeout or split the repair into more segments.

You can send the logs to o...@thelastpickle.com if none of the above match your issue and we'll take a closer look.

Cheers,

-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting


--
You received this message because you are subscribed to the Google Groups "TLP Apache Cassandra Reaper users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/5c79775d-82f5-463c-be51-218e172572b3%40googlegroups.com.

FH

unread,
Feb 7, 2020, 12:49:09 PM2/7/20
to TLP Apache Cassandra Reaper users
I re-ran a new repair after updating intensity to 1 and threads to 4. With default settings of (0.9 & 1), ETA was 15 days. Now, repair finished in 3 days i.e. 200 segments per day (vs. the initial 25). I'll let the datadog reaper metrics speak for themselves (see attached).

On a different note, I'll be more than happy to help put a document together explaining the reaper metrics. If you see value in this effort, please let me know. 

Appreciate the support. 
Screenshot - slow performance diagnostic data (UltraRecall copy).rtf - Compatibility Mode - Word - 2_7_2020 , 12_46_25 PM.png

Reid Pinchback

unread,
Feb 7, 2020, 1:11:28 PM2/7/20
to FH, TLP Apache Cassandra Reaper users

Our experience was also that intensity=1 was best.  And per Alexander’s comment, hitting the timeout was something we also had to tune for. 

 

It’s pretty easy for the combined impact of repairs and ongoing read/write work to result in pushing a node into a lot of extra GC activity as the heap gets very bloated temporarily.  With a large enough number of nodes, we were bound to have a few at any given time that were problem children, and just had to have a bit longer to complete.  Once we realized that and adjusted the timeout, the total time to repair dropped substantially.  Without that change, for a while you’re still having the pain of the added GC load from attempting repair, but giving up before reaping (no pun intended) any benefit from the attempt.

 

R

 

From: <tlp-apache-cassa...@googlegroups.com> on behalf of FH <fmha...@gmail.com>
Date: Friday, February 7, 2020 at 12:49 PM
To: TLP Apache Cassandra Reaper users <tlp-apache-cassa...@googlegroups.com>
Subject: Re: Reaper 1.4.1 Slow Performance: ~ 20 Segments/Day

 

Message from External Sender

I re-ran a new repair after updating intensity to 1 and threads to 4. With default settings of (0.9 & 1), ETA was 15 days. Now, repair finished in 3 days i.e. 200 segments per day (vs. the initial 25). I'll let the datadog reaper metrics speak for themselves (see attached).

 

On a different note, I'll be more than happy to help put a document together explaining the reaper metrics. If you see value in this effort, please let me know. 

 

Appreciate the support. 

--

You received this message because you are subscribed to the Google Groups "TLP Apache Cassandra Reaper users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlp-apache-cassandra-r...@googlegroups.com.

FH

unread,
Feb 7, 2020, 2:12:58 PM2/7/20
to TLP Apache Cassandra Reaper users
I wanted to find a way to measure if increasing timeout was necessary. I assume the 'fail count' is a good indicator. I couldn't find a metric yet for it, but I glance at it in UI. Before the change, we were averaging 35 'fail count' per segment. Today, near 530 segments completed with a failcount of < 2.5. 

I suppose every time the segmentrunner declines a repair request, failcount + 1. 

Thanks 
Reply all
Reply to author
Forward
0 new messages