OpenTSDB slows down during high load

482 views
Skip to first unread message

Anthony Caiafa

unread,
Feb 18, 2016, 3:47:05 PM2/18/16
to OpenTSDB

When running opentsdb 2.1 (6 instances, 16GB RAM each) and publishing data as fast as we can make it (goal 4mil datapoints/sec) we see that opentsdb will accept put requests FAR faster than it can dish them to HBase. This results in opentsdb hitting a lot of GC pauses, and in extreme cases completely hanging.


We end up getting a lot of trace like:

2016-02-18T15:04:12.935-0500: 584.046: [Full GC (Allocation Failure)  15G->14G(16G), 36.6974305 secs]

   [Eden: 0.0B(816.0M)->0.0B(816.0M) Survivors: 0.0B->0.0B Heap: 16.0G(16.0G)->14.1G(16.0G)], [Metaspace: 17737K->17737K(1064960K)]

 [Times: user=41.64 sys=9.42, real=36.70 secs]

 

What I see is when publishing as fast as possible, we start up with 2-2.5 million datapoints per second then quickly fall down to ~500k/sec and eventually trail off to 0-5k/sec. However, if we rate limit to 1mil/sec, we can sustain this without hitting these GC cycles.

 

Is this just a symptom of the asynchronous processing in OpenTSDB? Is there a setting to apply backpressure to the publishers? (Consistent 1mil/second is better than falling over TSDB instances).


I notice that 2.2 has a synchronous call that can be made, which may be a bit overkill but would likely prevent these issues.


Are there any recommendations to alleviate this problem (aside from artificial rate limiting)? More TSD instances?

ManOLamancha

unread,
Feb 18, 2016, 11:34:19 PM2/18/16
to OpenTSDB
On Thursday, February 18, 2016 at 12:47:05 PM UTC-8, Anthony Caiafa wrote:

When running opentsdb 2.1 (6 instances, 16GB RAM each) and publishing data as fast as we can make it (goal 4mil datapoints/sec) we see that opentsdb will accept put requests FAR faster than it can dish them to HBase. This results in opentsdb hitting a lot of GC pauses, and in extreme cases completely hanging.


We end up getting a lot of trace like:

2016-02-18T15:04:12.935-0500: 584.046: [Full GC (Allocation Failure)  15G->14G(16G), 36.6974305 secs]

   [Eden: 0.0B(816.0M)->0.0B(816.0M) Survivors: 0.0B->0.0B Heap: 16.0G(16.0G)->14.1G(16.0G)], [Metaspace: 17737K->17737K(1064960K)]

 [Times: user=41.64 sys=9.42, real=36.70 secs]

 

What I see is when publishing as fast as possible, we start up with 2-2.5 million datapoints per second then quickly fall down to ~500k/sec and eventually trail off to 0-5k/sec. However, if we rate limit to 1mil/sec, we can sustain this without hitting these GC cycles.

 

Is this just a symptom of the asynchronous processing in OpenTSDB? Is there a setting to apply backpressure to the publishers? (Consistent 1mil/second is better than falling over TSDB instances).


Yeah, there are a few components to this problem. The first is that the TSDs do buffer the writes for a particular region server for either 1 second or until it reached 1024 RPCs. Then it fires off the multi-action RPC to HBase and when the response comes back it executes a callback on each individual RPC. These are generally pretty short lived depending on how fast HBase can acknowledge the writes, usually much less than sub second for a typical region server.

Another problem that is more likely to affect you is the NSRE queue wherein if a region is moving or splitting or just offline, the TSD will spool up to 10K of these RPCs PER REGION in memory and retry them periodically once the region is back online. You should be able to watch the "tsd.hbase.nsre" metric and see it climb. Also check the region server to see if the regions are indeed splitting. If you're using 2.2, try setting "hbase.nsre.high_watermark" to a much lower number, e.g. 100 or something (and if you're not using it, it's finally released and safe to upgrade). That will prevent storing a ton of data in memory during a split. When the queue is full, the TSD will start throwing "PleaseThrottleExceptions" and if you can catch that in your producers, then you could implement some throttling logic that will help let HBase and the TSDs recover.


I notice that 2.2 has a synchronous call that can be made, which may be a bit overkill but would likely prevent these issues.


Well that or if you use HTTP for writes and catch those Please throttles then you can back off the writes a bit. Right now we don't have a really good limiting strategy but I want to add one.
 


Are there any recommendations to alleviate this problem (aside from artificial rate limiting)? More TSD instances?


Pre-splitting and sizing your regions will help a ton with the NSREs if you see those. Otherwise you'll still run into similar issues when adding more TSDs. We're running about 10Mps right now with 40 tsds and a queue size of 1. When the queue is full or hbase is having a problem, we requeue the data points to Kafka for a retry later and haven't had an OOM or issue with the TSDs in a long time.

Hope that helps a bit!

Sean Hanson

unread,
Feb 23, 2016, 2:35:36 PM2/23/16
to OpenTSDB
Hi, co-worker of OP. We upgraded to 2.2 and are testing that out.

We dropped the high_watermark to 100 and the low to 50. We aren't seeing any logging in the logs and I'm not seeing any error from HTTP or PleaseThrottle responses (unless they are 200 response codes?). Should we investigate dropping the low/high watermarks down further? 

WRT pre-splitting, we do that (much easier with the salting added in 2.2!). We are currently diving in and trying to reduce the compactions in HBase, but we know there will always be region-splits and compactions, so we would like to know how to handle them. We can easily add flow control logic to our publishers, we just aren't seeing those PleaseThrottles come back from OpenTSDB.

ManOLamancha

unread,
Feb 25, 2016, 12:51:26 PM2/25/16
to OpenTSDB
On Tuesday, February 23, 2016 at 11:35:36 AM UTC-8, Sean Hanson wrote:
Hi, co-worker of OP. We upgraded to 2.2 and are testing that out.

We dropped the high_watermark to 100 and the low to 50. We aren't seeing any logging in the logs and I'm not seeing any error from HTTP or PleaseThrottle responses (unless they are 200 response codes?). Should we investigate dropping the low/high watermarks down further? 

Hmm, try tracking the stat "tsd.hbase.nsre" and see if it's climbing. If so then try dropping the HWM to 5 or so. And if it's still GCing like crazy, could you grab and post a jmap histo please? Then we can see what's eating up the heap.


WRT pre-splitting, we do that (much easier with the salting added in 2.2!). We are currently diving in and trying to reduce the compactions in HBase, but we know there will always be region-splits and compactions, so we would like to know how to handle them. We can easily add flow control logic to our publishers, we just aren't seeing those PleaseThrottles come back from OpenTSDB.

That's a bummer that those exceptions aren't popping back. And you are writing over HTTP and getting 204's back? Another thing you *could* try would be to tune the number of data points sent per HTTP request. If you're only sending one metric per HTTP call, that would definitely create a lot of garbage. I had good luck with 50 per call but you could try more or less.

Anthony Caiafa

unread,
Feb 25, 2016, 4:14:24 PM2/25/16
to OpenTSDB


On Thursday, February 25, 2016 at 12:51:26 PM UTC-5, ManOLamancha wrote:
On Tuesday, February 23, 2016 at 11:35:36 AM UTC-8, Sean Hanson wrote:
Hi, co-worker of OP. We upgraded to 2.2 and are testing that out.

We dropped the high_watermark to 100 and the low to 50. We aren't seeing any logging in the logs and I'm not seeing any error from HTTP or PleaseThrottle responses (unless they are 200 response codes?). Should we investigate dropping the low/high watermarks down further? 

Hmm, try tracking the stat "tsd.hbase.nsre" and see if it's climbing. If so then try dropping the HWM to 5 or so. And if it's still GCing like crazy, could you grab and post a jmap histo please? Then we can see what's eating up the heap.
   
   Sure i will grab this and post it back. 


WRT pre-splitting, we do that (much easier with the salting added in 2.2!). We are currently diving in and trying to reduce the compactions in HBase, but we know there will always be region-splits and compactions, so we would like to know how to handle them. We can easily add flow control logic to our publishers, we just aren't seeing those PleaseThrottles come back from OpenTSDB.

That's a bummer that those exceptions aren't popping back. And you are writing over HTTP and getting 204's back? Another thing you *could* try would be to tune the number of data points sent per HTTP request. If you're only sending one metric per HTTP call, that would definitely create a lot of garbage. I had good luck with 50 per call but you could try more or less.

    Yeah we are pushing about 512 per call so that may be causing some of the issue here. We are writing over http also.

  With those 40 tsd's how many region servers are you running?  

ManOLamancha

unread,
Feb 26, 2016, 1:58:28 PM2/26/16
to OpenTSDB
On Thursday, February 25, 2016 at 1:14:24 PM UTC-8, Anthony Caiafa wrote:
    Yeah we are pushing about 512 per call so that may be causing some of the issue here. We are writing over http also. 

Ah cool, try playing around with that and see if it helps. It may be that smaller but more frequent posts can be cleaned up faster. 

  With those 40 tsd's how many region servers are you running?  

For our raw data table (standard TSDB format) we have 80 region servers with the RS using around 20GB of heap and 32GB of off-heap caching. 

Anthony Caiafa

unread,
Feb 26, 2016, 2:49:48 PM2/26/16
to ManOLamancha, OpenTSDB
How is the off-heap caching done? Also with 40 tsds you only have a certain set of regionservers running the tsd process? 

ManOLamancha

unread,
Feb 29, 2016, 1:31:56 PM2/29/16
to OpenTSDB, clars...@gmail.com
On Friday, February 26, 2016 at 11:49:48 AM UTC-8, Anthony Caiafa wrote:

How is the off-heap caching done? Also with 40 tsds you only have a certain set of regionservers running the tsd process? 

We're using the BucketCache http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.html for off-heap. We also have the TSDs running on VMs separate from the region servers so the RS has full access to the CPU (and we need it!) 

Anthony Caiafa

unread,
Feb 29, 2016, 1:43:24 PM2/29/16
to OpenTSDB, clars...@gmail.com
ahhh makes sense. Are your TSD's behind something like haproxy or another load balancing method? 

ManOLamancha

unread,
Feb 29, 2016, 2:02:04 PM2/29/16
to OpenTSDB, clars...@gmail.com
On Monday, February 29, 2016 at 10:43:24 AM UTC-8, Anthony Caiafa wrote:
ahhh makes sense. Are your TSD's behind something like haproxy or another load balancing method? 

Well the readers are behind a load balancing solution but our writers are actually consume from Kafka. We have a set of kafka partitions per TSD and hash writes to those partitions so we get an even balance and we can cache UIDs per TSD. It works pretty well.

Anthony Caiafa

unread,
Mar 1, 2016, 4:34:54 PM3/1/16
to OpenTSDB, clars...@gmail.com
Here is a small snippet:

Process 1

num     #instances         #bytes  class name
----------------------------------------------
   1:      77520146     6015881344  [B
   2:      36527144     3214388672  org.hbase.async.PutRequest
   3:      73054292     1753304576  [[B
   4:      36564608     1170067456  com.stumbleupon.async.Deferred
   5:      31000743     1095850256  [C
   6:      31001472      744035328  java.lang.String
   7:      22135194      708326208  java.util.HashMap$Node
   8:       4426902      354159296  [Ljava.util.HashMap$Node;
   9:       4427528      212521344  java.util.HashMap
  10:         44417      177221912  [Ljava.lang.Object;
  11:       4426660      177066400  net.opentsdb.core.IncomingDataPoint
  12:       4462339      142794848  [Lcom.stumbleupon.async.Callback;
  13:       4426512      141648384  net.opentsdb.tsd.PutDataPointRpc$2PutErrback
  14:       4426527       70824432  java.util.HashMap$EntrySet
  15:        360992        8663808  java.util.ArrayList
  16:          1324        4667488  [I
  17:         35827        2292928  org.hbase.async.MultiAction
  18:         35741        1429640  org.jboss.netty.channel.DefaultChannelFuture
  19:         42281        1352992  java.util.concurrent.ConcurrentHashMap$Node
  20:         35791        1145312  org.jboss.netty.buffer.BigEndianHeapChannelBuffer
  21:         35732        1143424  org.jboss.netty.channel.DownstreamMessageEvent
  22:         35935         862440  java.util.concurrent.ConcurrentLinkedQueue$Node
  23:         35827         859848  org.hbase.async.RegionClient$1MultiActionCallback
  24:         36154         578464  java.lang.Integer
  25:         13313         532520  org.hbase.async.generated.ClientPB$ResultOrException
  26:         13313         426016  org.hbase.async.generated.ClientPB$Result
  27:            53         340144  [Ljava.util.concurrent.ConcurrentHashMap$Node;
  28:          3096         324416  java.lang.Class
  29:          8258         264256  java.lang.StackTraceElement
  30:          6797         163128  com.google.protobuf.LiteralByteString
  31:          1695         122040  org.hbase.async.generated.ClientPB$MutationProto
  32:          3552          85248  java.util.Collections$UnmodifiableRandomAccessList
  33:          1695          81360  org.hbase.async.generated.ClientPB$MutationProto$ColumnValue$QualifierValue
  34:          1443          80808  java.util.zip.Deflater
  35:          1025          73800  ch.qos.logback.classic.spi.LoggingEvent
  36:           186          69936  java.lang.Thread
  37:          1695          67800  org.hbase.async.generated.ClientPB$Action
  38:          4202          67232  java.lang.Object
  39:          1486          59440  java.lang.ref.Finalizer
  40:          1695          54240  org.hbase.async.generated.ClientPB$MutationProto$ColumnValue
  41:          1029          49824  [Ljava.lang.StackTraceElement;
  42:          2048          49152  org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket
  43:           670          42880  org.hbase.async.generated.ClientPB$MutationProto$Builder
  44:            11          37056  [Ljava.nio.ByteBuffer;
  45:          1458          34992  java.util.zip.ZStreamRef
  46:           331          26736  [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
  47:           479          19160  java.lang.ref.SoftReference
  48:           553          17696  java.lang.ref.WeakReference
  49:           438          14016  java.lang.ThreadLocal$ThreadLocalMap$Entry
  50:           343          13584  [Ljava.lang.String;
  51:           192          12288  java.nio.DirectByteBuffer
  52:           162          11664  sun.nio.ch.EPollArrayWrapper
  53:           162          11664  sun.nio.ch.EPollSelectorImpl
  54:           457          10968  java.lang.Long
  55:           328          10496  java.util.concurrent.CountDownLatch$Sync
  56:           160          10240  org.jboss.netty.channel.socket.nio.NioWorker

Process 2 

 num     #instances         #bytes  class name                                                                                                                                                                                                    [1091/29800]
----------------------------------------------
   1:      55672484     4327518680  [B
   2:      26233095     2308512360  org.hbase.async.PutRequest
   3:      23078059     1271840816  [C
   4:      52466169     1259189592  [[B
   5:      26258830      840282560  com.stumbleupon.async.Deferred
   6:      22522620      540542880  java.lang.String
   7:      15891304      508521728  java.util.HashMap$Node
   8:       3178086      254254016  [Ljava.util.HashMap$Node;
   9:       1086355      221707176  [Ljava.lang.Object;
  10:       3178722      152578656  java.util.HashMap
  11:       3177863      127114520  net.opentsdb.core.IncomingDataPoint
  12:       3203628      102516144  [Lcom.stumbleupon.async.Callback;
  13:       3177879      101692128  net.opentsdb.tsd.PutDataPointRpc$2PutErrback
  14:        528058       80024712  [I
  15:       3177881       50846096  java.util.HashMap$EntrySet
  16:        526327       42108032  [S
  17:        259427        6226248  java.util.ArrayList
  18:        131798        5271920  org.hbase.async.NotServingRegionException
  19:        131367        5254680  org.hbase.async.PleaseThrottleException
  20:        101896        4075840  org.hbase.async.generated.ClientPB$ResultOrException
  21:        158924        3814176  java.lang.StringBuilder
  22:        101896        3260672  org.hbase.async.generated.ClientPB$Result
  23:         25743        1647552  org.hbase.async.MultiAction
  24:         25777        1031080  org.jboss.netty.channel.DefaultChannelFuture
  25:         31743        1015776  java.util.concurrent.ConcurrentHashMap$Node
  26:         25976         831232  org.jboss.netty.buffer.BigEndianHeapChannelBuffer
  27:         25767         824544  org.jboss.netty.channel.DownstreamMessageEvent
  28:         25963         623112  java.util.concurrent.ConcurrentLinkedQueue$Node
  29:         25744         617856  org.hbase.async.RegionClient$1MultiActionCallback
  30:         26585         425360  java.lang.Integer
  31:            53         340144  [Ljava.util.concurrent.ConcurrentHashMap$Node;
  32:          3097         324512  java.lang.Class
  33:          8192         262144  java.lang.StackTraceElement
  34:          1024          73728  ch.qos.logback.classic.spi.LoggingEvent
  35:           186          69936  java.lang.Thread
  36:          4182          66912  java.lang.Object
  37:          1959          62688  org.hbase.async.generated.ClientPB$RegionActionResult
  38:          1027          49200  [Ljava.lang.StackTraceElement;
  39:          2048          49152  org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket
  40:          1963          47112  java.util.Collections$UnmodifiableRandomAccessList
  41:             9          37008  [Ljava.nio.ByteBuffer;
  42:           331          26736  [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
  43:           354          19824  com.google.protobuf.CodedInputStream
  44:           476          19040  java.lang.ref.SoftReference
  45:           553          17696  java.lang.ref.WeakReference
  46:           425          13600  java.lang.ThreadLocal$ThreadLocalMap$Entry
  47:           326          13176  [Ljava.lang.String;
  48:           198          12672  java.nio.DirectByteBuffer
  49:           162          11664  sun.nio.ch.EPollArrayWrapper
  50:           162          11664  sun.nio.ch.EPollSelectorImpl
  51:           328          10496  java.util.concurrent.CountDownLatch$Sync
  52:           160          10240  org.jboss.netty.channel.socket.nio.NioWorker
  53:           163           9128  java.lang.invoke.MemberName
  54:           184           8832  java.nio.HeapByteBuffer
  55:           357           8568  com.google.protobuf.LiteralByteString
  56:             4           8256  [Lorg.jboss.netty.util.HashedWheelTimer$HashedWheelBucket;
  57:           190           8208  [J
  58:           512           8192  java.util.HashSet

ManOLamancha

unread,
Mar 2, 2016, 2:00:04 PM3/2/16
to OpenTSDB, clars...@gmail.com
On Tuesday, March 1, 2016 at 1:34:54 PM UTC-8, Anthony Caiafa wrote:
Here is a small snippet:

Process 1

num     #instances         #bytes  class name
----------------------------------------------
   1:      77520146     6015881344  [B
   2:      36527144     3214388672  org.hbase.async.PutRequest
   3:      73054292     1753304576  [[B
   4:      36564608     1170067456  com.stumbleupon.async.Deferred
   5:      31000743     1095850256  [C
   6:      31001472      744035328  java.lang.String

Process 2 

 num     #instances         #bytes  class name                                                                                                                                                                                                    [1091/29800]
----------------------------------------------
   1:      55672484     4327518680  [B
   2:      26233095     2308512360  org.hbase.async.PutRequest
   3:      23078059     1271840816  [C
   4:      52466169     1259189592  [[B
   5:      26258830      840282560  com.stumbleupon.async.Deferred
   6:      22522620      540542880  java.lang.String

Did you run jmap with the "live" flag or do these include the non-live objects? Either way it looks like you do have a backlog of PutRequests (20 to 30M) so we need to tune something there. 

Another thing to try is to set an inflight limit via "hbase.region_client.inflight_limit". I forgot this isn't set in the open source branch. Try tuning that and you should start seeing the please throttles.

Anthony Caiafa

unread,
Mar 2, 2016, 2:04:39 PM3/2/16
to ManOLamancha, OpenTSDB
On Wed, Mar 2, 2016 at 2:00 PM, ManOLamancha <clars...@gmail.com> wrote:
On Tuesday, March 1, 2016 at 1:34:54 PM UTC-8, Anthony Caiafa wrote:
Here is a small snippet:

Process 1

num     #instances         #bytes  class name
----------------------------------------------
   1:      77520146     6015881344  [B
   2:      36527144     3214388672  org.hbase.async.PutRequest
   3:      73054292     1753304576  [[B
   4:      36564608     1170067456  com.stumbleupon.async.Deferred
   5:      31000743     1095850256  [C
   6:      31001472      744035328  java.lang.String

Process 2 

 num     #instances         #bytes  class name                                                                                                                                                                                                    [1091/29800]
----------------------------------------------
   1:      55672484     4327518680  [B
   2:      26233095     2308512360  org.hbase.async.PutRequest
   3:      23078059     1271840816  [C
   4:      52466169     1259189592  [[B
   5:      26258830      840282560  com.stumbleupon.async.Deferred
   6:      22522620      540542880  java.lang.String

Did you run jmap with the "live" flag or do these include the non-live objects? Either way it looks like you do have a backlog of PutRequests (20 to 30M) so we need to tune something there. 
   I ran jmap -histo processid against the process when it filled the heap. So yeah 20 to 30M put requests in the backlog. Is there any reason these puts are piling up so much? I am sending about 1.5m/s with about 100 points in each put. What do you think is backing this up so much? Right now for this benchmarking i am doing 2 tsd's per region server and 9 region servers. 

Another thing to try is to set an inflight limit via "hbase.region_client.inflight_limit". I forgot this isn't set in the open source branch. Try tuning that and you should start seeing the please throttles.

Ok i will toy around with this setting a bit.  

ManOLamancha

unread,
Mar 2, 2016, 3:23:49 PM3/2/16
to OpenTSDB, clars...@gmail.com
On Wednesday, March 2, 2016 at 11:04:39 AM UTC-8, Anthony Caiafa wrote:

Did you run jmap with the "live" flag or do these include the non-live objects? Either way it looks like you do have a backlog of PutRequests (20 to 30M) so we need to tune something there. 
   I ran jmap -histo processid against the process when it filled the heap. So yeah 20 to 30M put requests in the backlog. Is there any reason these puts are piling up so much? I am sending about 1.5m/s with about 100 points in each put. What do you think is backing this up so much? Right now for this benchmarking i am doing 2 tsd's per region server and 9 region servers. 

Gotcha. You should also run with -histo:live just to verify the PutRequests still remain. I'm guessing that they're most likely in your inflight queue. ... which I just noticed isn't reported in the OS version. I'll add that. Try tracking your region server queue size. That's likely pretty large and means it's holding a lot of RPCs that it hasn't responded to yet. Since the client is async in nature, it will keep sending data to the region servers (or just queue it up in Netty's socket buffer) while waiting for responses. The inflight limit will prevent that.

Anthony Caiafa

unread,
Mar 3, 2016, 10:23:04 AM3/3/16
to ManOLamancha, OpenTSDB
From taking a look at the flushQueue on hbase it seems that at times it will spike to 300 Quadr. Which seems a bit crazy to me. 

ManOLamancha

unread,
Mar 3, 2016, 1:06:03 PM3/3/16
to OpenTSDB, clars...@gmail.com
On Thursday, March 3, 2016 at 7:23:04 AM UTC-8, Anthony Caiafa wrote:
From taking a look at the flushQueue on hbase it seems that at times it will spike to 300 Quadr. Which seems a bit crazy to me. 

Are you seeing that in Grafana? Make sure to set the the rate and counter checkboxes and put "1" for the reset value. That should fix it. 

Anthony Caiafa

unread,
Mar 3, 2016, 1:08:31 PM3/3/16
to ManOLamancha, OpenTSDB
Yep rate and counter checked and 1 for reset value. 

Anthony Caiafa

unread,
Mar 3, 2016, 1:13:15 PM3/3/16
to OpenTSDB, clars...@gmail.com
Yeah so the flushqueue seems fine. 
Reply all
Reply to author
Forward
0 new messages