OpenTSDB slows down during high load

Anthony Caiafa

unread,

Feb 18, 2016, 3:47:05 PM2/18/16

to OpenTSDB

When running opentsdb 2.1 (6 instances, 16GB RAM each) and publishing data as fast as we can make it (goal 4mil datapoints/sec) we see that opentsdb will accept put requests FAR faster than it can dish them to HBase. This results in opentsdb hitting a lot of GC pauses, and in extreme cases completely hanging.

We end up getting a lot of trace like:

2016-02-18T15:04:12.935-0500: 584.046: [Full GC (Allocation Failure) 15G->14G(16G), 36.6974305 secs]

[Eden: 0.0B(816.0M)->0.0B(816.0M) Survivors: 0.0B->0.0B Heap: 16.0G(16.0G)->14.1G(16.0G)], [Metaspace: 17737K->17737K(1064960K)]

[Times: user=41.64 sys=9.42, real=36.70 secs]

What I see is when publishing as fast as possible, we start up with 2-2.5 million datapoints per second then quickly fall down to ~500k/sec and eventually trail off to 0-5k/sec. However, if we rate limit to 1mil/sec, we can sustain this without hitting these GC cycles.

Is this just a symptom of the asynchronous processing in OpenTSDB? Is there a setting to apply backpressure to the publishers? (Consistent 1mil/second is better than falling over TSDB instances).

I notice that 2.2 has a synchronous call that can be made, which may be a bit overkill but would likely prevent these issues.

Are there any recommendations to alleviate this problem (aside from artificial rate limiting)? More TSD instances?

ManOLamancha

unread,

Feb 18, 2016, 11:34:19 PM2/18/16

to OpenTSDB

On Thursday, February 18, 2016 at 12:47:05 PM UTC-8, Anthony Caiafa wrote:

When running opentsdb 2.1 (6 instances, 16GB RAM each) and publishing data as fast as we can make it (goal 4mil datapoints/sec) we see that opentsdb will accept put requests FAR faster than it can dish them to HBase. This results in opentsdb hitting a lot of GC pauses, and in extreme cases completely hanging.

We end up getting a lot of trace like:

2016-02-18T15:04:12.935-0500: 584.046: [Full GC (Allocation Failure) 15G->14G(16G), 36.6974305 secs]

[Eden: 0.0B(816.0M)->0.0B(816.0M) Survivors: 0.0B->0.0B Heap: 16.0G(16.0G)->14.1G(16.0G)], [Metaspace: 17737K->17737K(1064960K)]

[Times: user=41.64 sys=9.42, real=36.70 secs]

What I see is when publishing as fast as possible, we start up with 2-2.5 million datapoints per second then quickly fall down to ~500k/sec and eventually trail off to 0-5k/sec. However, if we rate limit to 1mil/sec, we can sustain this without hitting these GC cycles.

Is this just a symptom of the asynchronous processing in OpenTSDB? Is there a setting to apply backpressure to the publishers? (Consistent 1mil/second is better than falling over TSDB instances).

Yeah, there are a few components to this problem. The first is that the TSDs do buffer the writes for a particular region server for either 1 second or until it reached 1024 RPCs. Then it fires off the multi-action RPC to HBase and when the response comes back it executes a callback on each individual RPC. These are generally pretty short lived depending on how fast HBase can acknowledge the writes, usually much less than sub second for a typical region server.

Another problem that is more likely to affect you is the NSRE queue wherein if a region is moving or splitting or just offline, the TSD will spool up to 10K of these RPCs PER REGION in memory and retry them periodically once the region is back online. You should be able to watch the "tsd.hbase.nsre" metric and see it climb. Also check the region server to see if the regions are indeed splitting. If you're using 2.2, try setting "hbase.nsre.high_watermark" to a much lower number, e.g. 100 or something (and if you're not using it, it's finally released and safe to upgrade). That will prevent storing a ton of data in memory during a split. When the queue is full, the TSD will start throwing "PleaseThrottleExceptions" and if you can catch that in your producers, then you could implement some throttling logic that will help let HBase and the TSDs recover.

I notice that 2.2 has a synchronous call that can be made, which may be a bit overkill but would likely prevent these issues.

Well that or if you use HTTP for writes and catch those Please throttles then you can back off the writes a bit. Right now we don't have a really good limiting strategy but I want to add one.

Are there any recommendations to alleviate this problem (aside from artificial rate limiting)? More TSD instances?

Pre-splitting and sizing your regions will help a ton with the NSREs if you see those. Otherwise you'll still run into similar issues when adding more TSDs. We're running about 10Mps right now with 40 tsds and a queue size of 1. When the queue is full or hbase is having a problem, we requeue the data points to Kafka for a retry later and haven't had an OOM or issue with the TSDs in a long time.

Hope that helps a bit!

Sean Hanson

unread,

Feb 23, 2016, 2:35:36 PM2/23/16

to OpenTSDB

Hi, co-worker of OP. We upgraded to 2.2 and are testing that out.

We dropped the high_watermark to 100 and the low to 50. We aren't seeing any logging in the logs and I'm not seeing any error from HTTP or PleaseThrottle responses (unless they are 200 response codes?). Should we investigate dropping the low/high watermarks down further?

WRT pre-splitting, we do that (much easier with the salting added in 2.2!). We are currently diving in and trying to reduce the compactions in HBase, but we know there will always be region-splits and compactions, so we would like to know how to handle them. We can easily add flow control logic to our publishers, we just aren't seeing those PleaseThrottles come back from OpenTSDB.

ManOLamancha

unread,

Feb 25, 2016, 12:51:26 PM2/25/16

to OpenTSDB

On Tuesday, February 23, 2016 at 11:35:36 AM UTC-8, Sean Hanson wrote:

Hi, co-worker of OP. We upgraded to 2.2 and are testing that out.

We dropped the high_watermark to 100 and the low to 50. We aren't seeing any logging in the logs and I'm not seeing any error from HTTP or PleaseThrottle responses (unless they are 200 response codes?). Should we investigate dropping the low/high watermarks down further?

Hmm, try tracking the stat "tsd.hbase.nsre" and see if it's climbing. If so then try dropping the HWM to 5 or so. And if it's still GCing like crazy, could you grab and post a jmap histo please? Then we can see what's eating up the heap.

WRT pre-splitting, we do that (much easier with the salting added in 2.2!). We are currently diving in and trying to reduce the compactions in HBase, but we know there will always be region-splits and compactions, so we would like to know how to handle them. We can easily add flow control logic to our publishers, we just aren't seeing those PleaseThrottles come back from OpenTSDB.

That's a bummer that those exceptions aren't popping back. And you are writing over HTTP and getting 204's back? Another thing you *could* try would be to tune the number of data points sent per HTTP request. If you're only sending one metric per HTTP call, that would definitely create a lot of garbage. I had good luck with 50 per call but you could try more or less.

Anthony Caiafa

unread,

Feb 25, 2016, 4:14:24 PM2/25/16

to OpenTSDB

On Thursday, February 25, 2016 at 12:51:26 PM UTC-5, ManOLamancha wrote:

On Tuesday, February 23, 2016 at 11:35:36 AM UTC-8, Sean Hanson wrote:
Hi, co-worker of OP. We upgraded to 2.2 and are testing that out.

We dropped the high_watermark to 100 and the low to 50. We aren't seeing any logging in the logs and I'm not seeing any error from HTTP or PleaseThrottle responses (unless they are 200 response codes?). Should we investigate dropping the low/high watermarks down further?

Hmm, try tracking the stat "tsd.hbase.nsre" and see if it's climbing. If so then try dropping the HWM to 5 or so. And if it's still GCing like crazy, could you grab and post a jmap histo please? Then we can see what's eating up the heap.

Sure i will grab this and post it back.

WRT pre-splitting, we do that (much easier with the salting added in 2.2!). We are currently diving in and trying to reduce the compactions in HBase, but we know there will always be region-splits and compactions, so we would like to know how to handle them. We can easily add flow control logic to our publishers, we just aren't seeing those PleaseThrottles come back from OpenTSDB.

That's a bummer that those exceptions aren't popping back. And you are writing over HTTP and getting 204's back? Another thing you *could* try would be to tune the number of data points sent per HTTP request. If you're only sending one metric per HTTP call, that would definitely create a lot of garbage. I had good luck with 50 per call but you could try more or less.

Yeah we are pushing about 512 per call so that may be causing some of the issue here. We are writing over http also.

With those 40 tsd's how many region servers are you running?

ManOLamancha

unread,

Feb 26, 2016, 1:58:28 PM2/26/16

to OpenTSDB

On Thursday, February 25, 2016 at 1:14:24 PM UTC-8, Anthony Caiafa wrote:

Yeah we are pushing about 512 per call so that may be causing some of the issue here. We are writing over http also.

Ah cool, try playing around with that and see if it helps. It may be that smaller but more frequent posts can be cleaned up faster.

With those 40 tsd's how many region servers are you running?

For our raw data table (standard TSDB format) we have 80 region servers with the RS using around 20GB of heap and 32GB of off-heap caching.

Anthony Caiafa

unread,

Feb 26, 2016, 2:49:48 PM2/26/16

to ManOLamancha, OpenTSDB

How is the off-heap caching done? Also with 40 tsds you only have a certain set of regionservers running the tsd process?

ManOLamancha

unread,

Feb 29, 2016, 1:31:56 PM2/29/16

to OpenTSDB, clars...@gmail.com

On Friday, February 26, 2016 at 11:49:48 AM UTC-8, Anthony Caiafa wrote:

How is the off-heap caching done? Also with 40 tsds you only have a certain set of regionservers running the tsd process?

We're using the BucketCache http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.html for off-heap. We also have the TSDs running on VMs separate from the region servers so the RS has full access to the CPU (and we need it!)

Anthony Caiafa

unread,

Feb 29, 2016, 1:43:24 PM2/29/16

to OpenTSDB, clars...@gmail.com

ahhh makes sense. Are your TSD's behind something like haproxy or another load balancing method?

ManOLamancha

unread,

Feb 29, 2016, 2:02:04 PM2/29/16

to OpenTSDB, clars...@gmail.com

On Monday, February 29, 2016 at 10:43:24 AM UTC-8, Anthony Caiafa wrote:

ahhh makes sense. Are your TSD's behind something like haproxy or another load balancing method?

Well the readers are behind a load balancing solution but our writers are actually consume from Kafka. We have a set of kafka partitions per TSD and hash writes to those partitions so we get an even balance and we can cache UIDs per TSD. It works pretty well.

Anthony Caiafa

unread,

Mar 1, 2016, 4:34:54 PM3/1/16

to OpenTSDB, clars...@gmail.com

Here is a small snippet:

Process 1

num #instances #bytes class name

----------------------------------------------

1: 77520146 6015881344 [B

2: 36527144 3214388672 org.hbase.async.PutRequest

3: 73054292 1753304576 [[B

4: 36564608 1170067456 com.stumbleupon.async.Deferred

5: 31000743 1095850256 [C

6: 31001472 744035328 java.lang.String

7: 22135194 708326208 java.util.HashMap$Node

8: 4426902 354159296 [Ljava.util.HashMap$Node;

9: 4427528 212521344 java.util.HashMap

10: 44417 177221912 [Ljava.lang.Object;

11: 4426660 177066400 net.opentsdb.core.IncomingDataPoint

12: 4462339 142794848 [Lcom.stumbleupon.async.Callback;

13: 4426512 141648384 net.opentsdb.tsd.PutDataPointRpc$2PutErrback

14: 4426527 70824432 java.util.HashMap$EntrySet

15: 360992 8663808 java.util.ArrayList

16: 1324 4667488 [I

17: 35827 2292928 org.hbase.async.MultiAction

18: 35741 1429640 org.jboss.netty.channel.DefaultChannelFuture

19: 42281 1352992 java.util.concurrent.ConcurrentHashMap$Node

20: 35791 1145312 org.jboss.netty.buffer.BigEndianHeapChannelBuffer

21: 35732 1143424 org.jboss.netty.channel.DownstreamMessageEvent

22: 35935 862440 java.util.concurrent.ConcurrentLinkedQueue$Node

23: 35827 859848 org.hbase.async.RegionClient$1MultiActionCallback

24: 36154 578464 java.lang.Integer

25: 13313 532520 org.hbase.async.generated.ClientPB$ResultOrException

26: 13313 426016 org.hbase.async.generated.ClientPB$Result

27: 53 340144 [Ljava.util.concurrent.ConcurrentHashMap$Node;

28: 3096 324416 java.lang.Class

29: 8258 264256 java.lang.StackTraceElement

30: 6797 163128 com.google.protobuf.LiteralByteString

31: 1695 122040 org.hbase.async.generated.ClientPB$MutationProto

32: 3552 85248 java.util.Collections$UnmodifiableRandomAccessList

33: 1695 81360 org.hbase.async.generated.ClientPB$MutationProto$ColumnValue$QualifierValue

34: 1443 80808 java.util.zip.Deflater

35: 1025 73800 ch.qos.logback.classic.spi.LoggingEvent

36: 186 69936 java.lang.Thread

37: 1695 67800 org.hbase.async.generated.ClientPB$Action

38: 4202 67232 java.lang.Object

39: 1486 59440 java.lang.ref.Finalizer

40: 1695 54240 org.hbase.async.generated.ClientPB$MutationProto$ColumnValue

41: 1029 49824 [Ljava.lang.StackTraceElement;

42: 2048 49152 org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket

43: 670 42880 org.hbase.async.generated.ClientPB$MutationProto$Builder

44: 11 37056 [Ljava.nio.ByteBuffer;

45: 1458 34992 java.util.zip.ZStreamRef

46: 331 26736 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;

47: 479 19160 java.lang.ref.SoftReference

48: 553 17696 java.lang.ref.WeakReference

49: 438 14016 java.lang.ThreadLocal$ThreadLocalMap$Entry

50: 343 13584 [Ljava.lang.String;

51: 192 12288 java.nio.DirectByteBuffer

52: 162 11664 sun.nio.ch.EPollArrayWrapper

53: 162 11664 sun.nio.ch.EPollSelectorImpl

54: 457 10968 java.lang.Long

55: 328 10496 java.util.concurrent.CountDownLatch$Sync

56: 160 10240 org.jboss.netty.channel.socket.nio.NioWorker

Process 2

num #instances #bytes class name [1091/29800]

----------------------------------------------

1: 55672484 4327518680 [B

2: 26233095 2308512360 org.hbase.async.PutRequest

3: 23078059 1271840816 [C

4: 52466169 1259189592 [[B

5: 26258830 840282560 com.stumbleupon.async.Deferred

6: 22522620 540542880 java.lang.String

7: 15891304 508521728 java.util.HashMap$Node

8: 3178086 254254016 [Ljava.util.HashMap$Node;

9: 1086355 221707176 [Ljava.lang.Object;

10: 3178722 152578656 java.util.HashMap

11: 3177863 127114520 net.opentsdb.core.IncomingDataPoint

12: 3203628 102516144 [Lcom.stumbleupon.async.Callback;

13: 3177879 101692128 net.opentsdb.tsd.PutDataPointRpc$2PutErrback

14: 528058 80024712 [I

15: 3177881 50846096 java.util.HashMap$EntrySet

16: 526327 42108032 [S

17: 259427 6226248 java.util.ArrayList

18: 131798 5271920 org.hbase.async.NotServingRegionException

19: 131367 5254680 org.hbase.async.PleaseThrottleException

20: 101896 4075840 org.hbase.async.generated.ClientPB$ResultOrException

21: 158924 3814176 java.lang.StringBuilder

22: 101896 3260672 org.hbase.async.generated.ClientPB$Result

23: 25743 1647552 org.hbase.async.MultiAction

24: 25777 1031080 org.jboss.netty.channel.DefaultChannelFuture

25: 31743 1015776 java.util.concurrent.ConcurrentHashMap$Node

26: 25976 831232 org.jboss.netty.buffer.BigEndianHeapChannelBuffer

27: 25767 824544 org.jboss.netty.channel.DownstreamMessageEvent

28: 25963 623112 java.util.concurrent.ConcurrentLinkedQueue$Node

29: 25744 617856 org.hbase.async.RegionClient$1MultiActionCallback

30: 26585 425360 java.lang.Integer

31: 53 340144 [Ljava.util.concurrent.ConcurrentHashMap$Node;

32: 3097 324512 java.lang.Class

33: 8192 262144 java.lang.StackTraceElement

34: 1024 73728 ch.qos.logback.classic.spi.LoggingEvent

35: 186 69936 java.lang.Thread

36: 4182 66912 java.lang.Object

37: 1959 62688 org.hbase.async.generated.ClientPB$RegionActionResult

38: 1027 49200 [Ljava.lang.StackTraceElement;

39: 2048 49152 org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket

40: 1963 47112 java.util.Collections$UnmodifiableRandomAccessList

41: 9 37008 [Ljava.nio.ByteBuffer;

42: 331 26736 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;

43: 354 19824 com.google.protobuf.CodedInputStream

44: 476 19040 java.lang.ref.SoftReference

45: 553 17696 java.lang.ref.WeakReference

46: 425 13600 java.lang.ThreadLocal$ThreadLocalMap$Entry

47: 326 13176 [Ljava.lang.String;

48: 198 12672 java.nio.DirectByteBuffer

49: 162 11664 sun.nio.ch.EPollArrayWrapper

50: 162 11664 sun.nio.ch.EPollSelectorImpl

51: 328 10496 java.util.concurrent.CountDownLatch$Sync

52: 160 10240 org.jboss.netty.channel.socket.nio.NioWorker

53: 163 9128 java.lang.invoke.MemberName

54: 184 8832 java.nio.HeapByteBuffer

55: 357 8568 com.google.protobuf.LiteralByteString

56: 4 8256 [Lorg.jboss.netty.util.HashedWheelTimer$HashedWheelBucket;

57: 190 8208 [J

58: 512 8192 java.util.HashSet

ManOLamancha

unread,

Mar 2, 2016, 2:00:04 PM3/2/16

to OpenTSDB, clars...@gmail.com

On Tuesday, March 1, 2016 at 1:34:54 PM UTC-8, Anthony Caiafa wrote:

Here is a small snippet:

Process 1

num #instances #bytes class name
----------------------------------------------
1: 77520146 6015881344 [B
2: 36527144 3214388672 org.hbase.async.PutRequest
3: 73054292 1753304576 [[B
4: 36564608 1170067456 com.stumbleupon.async.Deferred
5: 31000743 1095850256 [C
6: 31001472 744035328 java.lang.String

Process 2

num #instances #bytes class name [1091/29800]
----------------------------------------------
1: 55672484 4327518680 [B
2: 26233095 2308512360 org.hbase.async.PutRequest
3: 23078059 1271840816 [C
4: 52466169 1259189592 [[B
5: 26258830 840282560 com.stumbleupon.async.Deferred
6: 22522620 540542880 java.lang.String

Did you run jmap with the "live" flag or do these include the non-live objects? Either way it looks like you do have a backlog of PutRequests (20 to 30M) so we need to tune something there.

Another thing to try is to set an inflight limit via "hbase.region_client.inflight_limit". I forgot this isn't set in the open source branch. Try tuning that and you should start seeing the please throttles.

Anthony Caiafa

unread,

Mar 2, 2016, 2:04:39 PM3/2/16

to ManOLamancha, OpenTSDB

On Wed, Mar 2, 2016 at 2:00 PM, ManOLamancha <clars...@gmail.com> wrote:

On Tuesday, March 1, 2016 at 1:34:54 PM UTC-8, Anthony Caiafa wrote:
Here is a small snippet:

Process 1

num #instances #bytes class name
----------------------------------------------
1: 77520146 6015881344 [B
2: 36527144 3214388672 org.hbase.async.PutRequest
3: 73054292 1753304576 [[B
4: 36564608 1170067456 com.stumbleupon.async.Deferred
5: 31000743 1095850256 [C
6: 31001472 744035328 java.lang.String

Process 2

num #instances #bytes class name [1091/29800]
----------------------------------------------
1: 55672484 4327518680 [B
2: 26233095 2308512360 org.hbase.async.PutRequest
3: 23078059 1271840816 [C
4: 52466169 1259189592 [[B
5: 26258830 840282560 com.stumbleupon.async.Deferred
6: 22522620 540542880 java.lang.String

Did you run jmap with the "live" flag or do these include the non-live objects? Either way it looks like you do have a backlog of PutRequests (20 to 30M) so we need to tune something there.

I ran jmap -histo processid against the process when it filled the heap. So yeah 20 to 30M put requests in the backlog. Is there any reason these puts are piling up so much? I am sending about 1.5m/s with about 100 points in each put. What do you think is backing this up so much? Right now for this benchmarking i am doing 2 tsd's per region server and 9 region servers.

Another thing to try is to set an inflight limit via "hbase.region_client.inflight_limit". I forgot this isn't set in the open source branch. Try tuning that and you should start seeing the please throttles.

Ok i will toy around with this setting a bit.

ManOLamancha

unread,

Mar 2, 2016, 3:23:49 PM3/2/16

to OpenTSDB, clars...@gmail.com

On Wednesday, March 2, 2016 at 11:04:39 AM UTC-8, Anthony Caiafa wrote:

Did you run jmap with the "live" flag or do these include the non-live objects? Either way it looks like you do have a backlog of PutRequests (20 to 30M) so we need to tune something there.
I ran jmap -histo processid against the process when it filled the heap. So yeah 20 to 30M put requests in the backlog. Is there any reason these puts are piling up so much? I am sending about 1.5m/s with about 100 points in each put. What do you think is backing this up so much? Right now for this benchmarking i am doing 2 tsd's per region server and 9 region servers.

Gotcha. You should also run with -histo:live just to verify the PutRequests still remain. I'm guessing that they're most likely in your inflight queue. ... which I just noticed isn't reported in the OS version. I'll add that. Try tracking your region server queue size. That's likely pretty large and means it's holding a lot of RPCs that it hasn't responded to yet. Since the client is async in nature, it will keep sending data to the region servers (or just queue it up in Netty's socket buffer) while waiting for responses. The inflight limit will prevent that.

Anthony Caiafa

unread,

Mar 3, 2016, 10:23:04 AM3/3/16

to ManOLamancha, OpenTSDB

From taking a look at the flushQueue on hbase it seems that at times it will spike to 300 Quadr. Which seems a bit crazy to me.

ManOLamancha

unread,

Mar 3, 2016, 1:06:03 PM3/3/16

to OpenTSDB, clars...@gmail.com

On Thursday, March 3, 2016 at 7:23:04 AM UTC-8, Anthony Caiafa wrote:

From taking a look at the flushQueue on hbase it seems that at times it will spike to 300 Quadr. Which seems a bit crazy to me.

Are you seeing that in Grafana? Make sure to set the the rate and counter checkboxes and put "1" for the reset value. That should fix it.

Anthony Caiafa

unread,

Mar 3, 2016, 1:08:31 PM3/3/16

to ManOLamancha, OpenTSDB

Yep rate and counter checked and 1 for reset value.

Anthony Caiafa

unread,

Mar 3, 2016, 1:13:15 PM3/3/16

to OpenTSDB, clars...@gmail.com

Yeah so the flushqueue seems fine.

Reply all

Reply to author

Forward