Feedback on OpenTSDB (2.0) from our testing

Chi Hoang

unread,

May 6, 2013, 2:26:48 PM5/6/13

to open...@googlegroups.com

Hi,
First, I'd like to thank the developers for a great tool, and I really appreciate the community involvement throughout the process.

My experience was mostly positive building a test environment, but we ultimate encountered a few gaps that prevented us from making a definitive commitment to using OpenTSDB. The issues we encountered can be broken down into two categories: performance and durability.

I ran these benchmarks on the next branch (OpenTSDB 2.0) with HEAD at f2f3b69a39e73fa2647b4b6ccf892e9577c80f58.

Performance

The biggest problem we had with performance is the absence of pre-aggregation. I've read through the discussion on this feature which is on the fringe of the road-map, and am hopeful for its inclusion at some point. Our use case is much simpler and is covered in the road-map. We need multiple grains of time, so we can query further back in time. Our general case is to provide data for the last 24 hours, but our users also need to be able to look at metrics year-over-year. For us, that may mean going from capturing metrics at second precision, to minutes, to 5-minutes, and so on.

The other aspect of performance I had issues with was the inconsistent write rate, and it may be completely due to HBase restrictions. I had a 3-node HBase cluster that ran on the same machines as the HDFS cluster, so there might be some configuration tuning that I need. I read that the recommended cluster size was somewhere around 10 nodes, but I went with 3-nodes because we were trying to test the boundaries. I was able to get extremely good burst performance with this configuration. Using two TSDs, I was able to burst writes at 2.2mm dps/s, and it was bursting at around 1.3mm dps/s with just a single TSD. This rate could not be sustained for long, and it often resulted in symptoms that others on this group also encountered. I run parallel workers to create my load, and sometimes TSD gets into a mode where it just single threads at 100% for a period before running again in parallel threads. I saw this very frequently after large bursts, and the rate dropped to under 10k dps/s. As I mentioned, I suspected it was due to HBase configuration because I tried various max region filesizes, and this happened most with larger sizes (5GB). I got the best throughput with a 256MB max region filesize, but that caused too many regions to be created.

Durability

As I ran my performance tests, I encountered some erratic behavior when I tried to create sustained load. For instance, writing 600k dps/s for six minutes. I often saw that TSD could not keep up after a while and began single threading (see my note about performance above). When I first encountered this scenario, I thought it was because TSD was wedged, so I had to restart, which resulted in my client dropping off. The client code is a problem on our end, and using tcollector would likely protect us from this scenario of restarting TSD. The other scenario was when we had to restart HBase, and TSD could not re-establish a connection. This is the bigger problem as I can see us relying on this in the future. I had expected TSD to buffer and send metrics after re-establishing connectivity.

Thanks again for helping me along my testing.

Chi

tsuna

unread,

May 7, 2013, 1:53:42 PM5/7/13

to Chi Hoang, open...@googlegroups.com

Hi Chi,
thanks for your feedback.

On Mon, May 6, 2013 at 11:26 AM, Chi Hoang <chi....@gmail.com> wrote:
> I ran these benchmarks on the next branch (OpenTSDB 2.0) with HEAD at

Did you actually test anything 2.0 specific?

> The biggest problem we had with performance is the absence of
> pre-aggregation. I've read through the discussion on this feature which is
> on the fringe of the road-map, and am hopeful for its inclusion at some
> point. Our use case is much simpler and is covered in the road-map. We
> need multiple grains of time, so we can query further back in time. Our
> general case is to provide data for the last 24 hours, but our users also
> need to be able to look at metrics year-over-year. For us, that may mean
> going from capturing metrics at second precision, to minutes, to 5-minutes,
> and so on.

Yes, that's understandable. Right now you'd have to do the
pre-aggregation either separately (e.g. as a process that reads the
fine-grained data and stores back a 5-minute average under a separate
metric name) or within your collector (simpler I think).

> The other aspect of performance I had issues with was the inconsistent write
> rate, and it may be completely due to HBase restrictions. I had a 3-node
> HBase cluster that ran on the same machines as the HDFS cluster, so there
> might be some configuration tuning that I need.

Running HBase on the same nodes as HDFS is the recommended deployment.
This way HBase is able to write and read data locally, to a certain
extent. Some have also enjoyed nice performance speedups by turning
on the "HDFS read short circuit" documented here:
http://hbase.apache.org/book/perf.hdfs.html But this is not a
requirement for good performance. Just when you want to squeeze out
the last few drops of performance out of an already well tuned setup.

> I read that the recommended
> cluster size was somewhere around 10 nodes, but I went with 3-nodes because
> we were trying to test the boundaries.

Yeah it's just that often 3 nodes don't perform much better than 1.
So on the low end, HBase doesn't "down-scale" linearly. Things get
better past 5-10 nodes.

> This rate could not be sustained for long, and it often
> resulted in symptoms that others on this group also encountered.

Can you please outline which symptoms you observed, other than the
single-thread at 100% you mentioned below?

> parallel workers to create my load, and sometimes TSD gets into a mode where
> it just single threads at 100% for a period before running again in parallel
> threads.

This is not supposed to happen. Sounds like a possible GC issue.
Have you looked at the GC log? Were you somehow running TSD out of
memory?

> As I mentioned, I suspected it was due to HBase
> configuration because I tried various max region filesizes, and this
> happened most with larger sizes (5GB). I got the best throughput with a
> 256MB max region filesize, but that caused too many regions to be created.

I'm not sure why the region size would cause TSD to go mental and get
~stuck in a 100% CPU loop. It also seems counterintuitive that you
got better results with smaller region sizes. This is the opposite of
what we generally observe.

If you want to do high-throughput write tests with any HBase
application, you first need to pre-split a few regions. Have you done
that, or did you always start from an empty table?

> The other scenario was when we had to restart
> HBase, and TSD could not re-establish a connection.

What was happening then?

> This is the bigger problem as I can see us relying on this in the future. I had expected TSD
> to buffer and send metrics after re-establishing connectivity.

Yes it's supposed to do that.

--
Benoit "tsuna" Sigoure

Chi Hoang

unread,

May 8, 2013, 4:50:02 PM5/8/13

to open...@googlegroups.com, Chi Hoang

On Tuesday, May 7, 2013 10:53:42 AM UTC-7, tsuna wrote:

Did you actually test anything 2.0 specific?

I was mostly interested in the api/query endpoint for the JSON response and decoupling of gnuplot. We also plan on using the metadata features, and have the tracking enabled. Now that I think of it, I read somewhere that the current metadata code has locking, and perhaps that is contributing to my performance problems.

Yes, that's understandable. Right now you'd have to do the
pre-aggregation either separately (e.g. as a process that reads the
fine-grained data and stores back a 5-minute average under a separate
metric name) or within your collector (simpler I think).

We already have concerns over the limited metric namespace, so using it to capture the different levels of aggregation will put additional pressure on that front. We also want to keep the collector lean to limit the overhead on the monitored system. Regardless, at the current state, we'd have to invest in some R&D to add this feature, so it will factor into the decision at the end.

Can you please outline which symptoms you observed, other than the
single-thread at 100% you mentioned below?

I'm not sure if it is related to server connectivity, but I've gotten messages in the log indicating that it could not retrieve logs, and another message where the string "WTF" was involved :) As I mentioned above, the metadata feature might be contributing to some of these inconsistencies. For instance, on the base configuration, I could run 40 worker processes that log 100k data points each, and get a 250k dps/s throughput, then run another iteration, and get half the throughput, and yet another iteration where the throughput shrinks to under 10k dps/s - this is the scenario where TSD maxes on a single core.

> parallel workers to create my load, and sometimes TSD gets into a mode where
> it just single threads at 100% for a period before running again in parallel
> threads.

This is not supposed to happen. Sounds like a possible GC issue.
Have you looked at the GC log? Were you somehow running TSD out of
memory?

I attached jconsole to TSD but didn't see anything with GC. Could it be the metadata locks?

I'm not sure why the region size would cause TSD to go mental and get
~stuck in a 100% CPU loop. It also seems counterintuitive that you
got better results with smaller region sizes. This is the opposite of
what we generally observe.

If you want to do high-throughput write tests with any HBase
application, you first need to pre-split a few regions. Have you done
that, or did you always start from an empty table?

Are there recommendations on pre-splitting? I asked around with our HBase devs, and they also recommended pre-splitting, but I wasn't sure what split boundaries to use.

> The other scenario was when we had to restart
> HBase, and TSD could not re-establish a connection.

What was happening then?

I got a combination of messages indicating it could not lock, and another that started out with "WTF?". Here are some stack traces that I grabbed from our logs:

2013-05-03 18:47:28,803 ERROR [New I/O worker #35] ConnectionManager: Unexpected exception from downstream for [id: 0x5278781f, ...
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcher.write0(Native Method) ~[na:1.6.0_24]
...
    at net.opentsdb.tsd.PutDataPointRpc.execute(PutDataPointRpc.java:74) ~[tsdb-2.0.0.jar:c4434d8]
    at net.opentsdb.tsd.RpcHandler.handleTelnetRpc(RpcHandler.java:162) ~[tsdb-2.0.0.jar:c4434d8]
    at net.opentsdb.tsd.RpcHandler.messageReceived(RpcHandler.java:132) ~[tsdb-2.0.0.jar:c4434d8]
...

2013-05-03 19:17:11,028 ERROR [New I/O worker #52] RegionClient: Unexpected exception from downstream on [id: 0xf7a3e24d, ...
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.write0(Native Method) ~[na:1.6.0_24]
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) ~[na:1.6.0_24]
...

A lot of these, but it seems these might be normal:
2013-05-04 01:02:31,057 INFO [New I/O worker #51] HBaseClient: There are now 9000 RPCs pending

2013-05-04 12:49:35,963 ERROR [New I/O worker #52] RegionClient: Unexpected exception from downstream on ...
java.nio.channels.ClosedChannelException: null
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:409) ~[netty-3.6.2.Final.jar:na]
...

2013-05-04 12:49:36,034 ERROR [New I/O worker #52] HBaseClient: WTF? Trying to add AtomicIncrementRequest(table="tsdb-uid",
...

org.hbase.async.NonRecoverableException: Invalid rpcid: 1062670 found in ReplayingDecoderBuffer(ridx=15, widx=3728)
...

I may have missed a few others.

Thanks again. I'm working on testing some other options in the meantime.

Chi

Jonathan Creasy

unread,

May 8, 2013, 5:01:55 PM5/8/13

to Chi Hoang, open...@googlegroups.com

I can send you the table creation script I use to pre-split. I should be able to do that in an hour or so.

ManOLamancha

unread,

May 8, 2013, 5:20:06 PM5/8/13

to open...@googlegroups.com, Chi Hoang

On Wednesday, May 8, 2013 4:50:02 PM UTC-4, Chi Hoang wrote:

I was mostly interested in the api/query endpoint for the JSON response and decoupling of gnuplot. We also plan on using the metadata features, and have the tracking enabled. Now that I think of it, I read somewhere that the current metadata code has locking, and perhaps that is contributing to my performance problems.

The meta should only lock (with the current code) when a new time series or metric/tag name/tag value is encountered. If you already have IDs assigned to all of your tags and you're just pushing new data points for existing timeseries, then the metadata shouldn't be much of a factor. But it wouldn't hurt to try turning off the metadata tracking and see if that improves throughput.

We already have concerns over the limited metric namespace, so using it to capture the different levels of aggregation will put additional pressure on that front. We also want to keep the collector lean to limit the overhead on the monitored system. Regardless, at the current state, we'd have to invest in some R&D to add this feature, so it will factor into the decision at the end.

RE: the metric namespace, do you mean the 16M unique limit? You can increase that to 4 bytes to get 4.2B unique metrics if necessary.

A lot of these, but it seems these might be normal:
2013-05-04 01:02:31,057 INFO [New I/O worker #51] HBaseClient: There are now 9000 RPCs pending

Yep, those are normal and happen when HBase is splitting a region. The TSD buffers the RPCs until the new region is available.

Also, are you writing data through the Telnet interface? Have you tried writing through the HTTP API to see if that makes any difference?

Tsuna will have to comment on the others :) Thanks!

tsuna

unread,

May 8, 2013, 6:16:43 PM5/8/13

to Chi Hoang, open...@googlegroups.com

On Wed, May 8, 2013 at 1:50 PM, Chi Hoang <chi....@gmail.com> wrote:
> I was mostly interested in the api/query endpoint for the JSON response and
> decoupling of gnuplot. We also plan on using the metadata features, and
> have the tracking enabled. Now that I think of it, I read somewhere that
> the current metadata code has locking, and perhaps that is contributing to
> my performance problems.

The "next" branch is the bleeding edge, so yes there may be
performance issues there. We recently devised a lock-less way of
updating time series metadata, but I'm not sure it's in the "next"
branch upstream yet.

> We already have concerns over the limited metric namespace

In Const.java, you can just switch everything to be on 4 bytes as
ManOLamancha suggested, to give yourself some breathing room. Just
bear in mind the reason this value is a compile-time constant is that
you cannot change it easily once you start using it. It would require
re-writing the entire table.

> I'm not sure if it is related to server connectivity, but I've gotten
> messages in the log indicating that it could not retrieve logs, and another
> message where the string "WTF" was involved :)

I'm not sure what message you're referring to when you're saying "it
could not retrieve logs". Was this an HBase message?

The messages that say "WTF" are surely from asynchbase or OpenTSDB.
They indicate something really unexpected happened, and most of them
should be treated as bugs. If you can provide more details
surrounding each WTF message, that would be helpful.

> I attached jconsole to TSD but didn't see anything with GC. Could it be the
> metadata locks?

No, I can't really see how any sort of lock could cause TSD to end up
in a state where a single thread is spinning. If it wasn't GC, then
what was it? Have you captured a few stack traces (with jstack -l)
while the problem was occurring? It could help us understand what's
going on there.

> Are there recommendations on pre-splitting? I asked around with our HBase
> devs, and they also recommended pre-splitting, but I wasn't sure what split
> boundaries to use.

Assuming you keep the UIDs on 3 bytes, then you want to pre-split
based on that. You just create a bunch of regions with start/stop
keys that are 3 bytes, for reasonable ranges. For example you could
set your region boundaries to [ "\x00\x00\x00", "\x00\x00\x7F",
"\x00\x01\x00", "\x00\x01\x7F", "\x00\x02\x00", "\x00\x02\x7F", … ]
and this would put about 128 metrics per region.

> I got a combination of messages indicating it could not lock, and another
> that started out with "WTF?". Here are some stack traces that I grabbed
> from our logs:
>
> 2013-05-03 18:47:28,803 ERROR [New I/O worker #35] ConnectionManager:
> Unexpected exception from downstream for [id: 0x5278781f, ...
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcher.write0(Native Method) ~[na:1.6.0_24]
> ...
> at net.opentsdb.tsd.PutDataPointRpc.execute(PutDataPointRpc.java:74)
> ~[tsdb-2.0.0.jar:c4434d8]
> at net.opentsdb.tsd.RpcHandler.handleTelnetRpc(RpcHandler.java:162)
> ~[tsdb-2.0.0.jar:c4434d8]
> at net.opentsdb.tsd.RpcHandler.messageReceived(RpcHandler.java:132)
> ~[tsdb-2.0.0.jar:c4434d8]
> ...

You elided some important parts of the stack trace, but this seems to
indicate that TSD was trying to write an error message back to the
client, and that the client had already disconnected the socket.

> 2013-05-03 19:17:11,028 ERROR [New I/O worker #52] RegionClient: Unexpected
> exception from downstream on [id: 0xf7a3e24d, ...
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.write0(Native Method) ~[na:1.6.0_24]
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> ~[na:1.6.0_24]
> ...

This one comes from asynchbase and seems to indicate that HBase
unexpectedly closed a connection from TSD. This is not supposed to
happen, unless maybe your HBase RegionServers are crashing. Was there
any crash or anything unusual on the HBase side?

> A lot of these, but it seems these might be normal:
> 2013-05-04 01:02:31,057 INFO [New I/O worker #51] HBaseClient: There are
> now 9000 RPCs pending

These are "normal" indeed (logged at the "INFO" level), they indicate
that a region in HBase is temporarily unavailable and that TSD has to
buffer the RPCs as ManOLamancha said. You don't want this to happen
too often though, as during that times the write will pile up in TSD
and if this goes on for too long, TSD will start throwing away data
(and send an error message back to the clients).

> 2013-05-04 12:49:35,963 ERROR [New I/O worker #52] RegionClient: Unexpected
> exception from downstream on ...
> java.nio.channels.ClosedChannelException: null
> at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:409)
> ~[netty-3.6.2.Final.jar:na]
> ...

This one appears similar to the "Connection reset by peer" above. Not
supposed to happen either unless HBase is crashing.

> 2013-05-04 12:49:36,034 ERROR [New I/O worker #52] HBaseClient: WTF? Trying
> to add AtomicIncrementRequest(table="tsdb-uid",
> ...

You truncated some important part of the error here no?

> org.hbase.async.NonRecoverableException: Invalid rpcid: 1062670 found in
> ReplayingDecoderBuffer(ridx=15, widx=3728)
> ...

This is definitely NOT supposed to happen, something really bad
happened here. This would indicate that HBase returned a response for
an RPC ID we did not expect.

--
Benoit "tsuna" Sigoure

Jonathan Creasy

unread,

May 8, 2013, 6:53:26 PM5/8/13

to tsuna, Chi Hoang, open...@googlegroups.com

Here are the scripts we use at Box to create the tables for OpenTSDB:

We run create_table.rb then create_table.sh

https://gist.github.com/johann8384/5544290

--

Jonathan Creasy | Sr. Ops Engineer

e: j...@box.com | t: 314.580.8909

Chi Hoang

unread,

May 9, 2013, 2:35:50 PM5/9/13

to open...@googlegroups.com, tsuna, Chi Hoang, jcr...@box.com, j...@box.com

Thanks. This looks like it will help with my next round of testing. Did you mean to replace 255 with regions in the .rb?

Chi

Jonathan Creasy

unread,

May 9, 2013, 2:37:38 PM5/9/13

to Chi Hoang, open...@googlegroups.com, tsuna

Yes, we always use 255 so I never noticed that. :)

Chi Hoang

unread,

May 9, 2013, 2:47:15 PM5/9/13

to open...@googlegroups.com, Chi Hoang

I'm not sure what message you're referring to when you're saying "it
could not retrieve logs". Was this an HBase message?

Sorry I meant locks. I remember that I saw at least two different types of messages but could only find one in my current set of log files:
2013-05-03 18:56:16,780 WARN [New I/O worker #38] UniqueId: Failed to lock the `MAXID_ROW' row
org.hbase.async.RemoteException: java.io.IOException: Timed out on getting lock for row=\x00
    at org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:3239)
    at org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:3206)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2627)
    at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
    at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400)

    at org.hbase.async.RegionClient.deserializeException(RegionClient.java:1257) ~[asynchbase-1.4.1.jar:8c94a38]

The messages that say "WTF" are surely from asynchbase or OpenTSDB.
They indicate something really unexpected happened, and most of them
should be treated as bugs. If you can provide more details
surrounding each WTF message, that would be helpful.

This type of message gets repeated many times in my log files:

2013-05-04 12:49:36,034 ERROR [New I/O worker #52] HBaseClient: WTF? Trying to add AtomicIncrementRequest(table="tsdb-uid", key=[0, 1, 39, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 11, 0, 0, 3, 0, 0, 59], family="name", qualifier="ts_ct
r", amount=1, attempt=5, region=RegionInfo(table="tsdb-uid", region_name="tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0.", stop_key=[0, 4, -33, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 28, 0, 0, 3, 0, 0, 14])) twice to NSREd
RPC on "tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0."

> I attached jconsole to TSD but didn't see anything with GC. Could it be the
> metadata locks?

No, I can't really see how any sort of lock could cause TSD to end up
in a state where a single thread is spinning. If it wasn't GC, then
what was it? Have you captured a few stack traces (with jstack -l)
while the problem was occurring? It could help us understand what's
going on there.

I ran jstack for different periods, and the 100% single process scenario included these blocks:

New I/O worker #5" prio=10 tid=0x00002aaab4170000 nid=0x5f17 waiting for monitor entry [0x000000004371d000]
     java.lang.Thread.State: BLOCKED (on object monitor)
     at org.hbase.async.RegionClient.sendRpc(RegionClient.java:835)
     - waiting to lock <0x0000000268f9a3d0> (a org.hbase.async.RegionClient)
     at org.hbase.async.HBaseClient.sendRpcToRegion(HBaseClient.java:1501)
     at org.hbase.async.HBaseClient.put(HBaseClient.java:1289)
     at net.opentsdb.core.TSDB.addPointInternal(TSDB.java:457)
     at net.opentsdb.core.TSDB.addPoint(TSDB.java:393)
     at net.opentsdb.tsd.PutDataPointRpc.importDataPoint(PutDataPointRpc.java:246)
     at net.opentsdb.tsd.PutDataPointRpc.execute(PutDataPointRpc.java:62)
     at net.opentsdb.tsd.RpcHandler.handleTelnetRpc(RpcHandler.java:162)
     at net.opentsdb.tsd.RpcHandler.messageReceived(RpcHandler.java:132)
     at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
     at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
     at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
     at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
     at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
     at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
     at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
     at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
     at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
     at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
     at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
     at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
     at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
     at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
     at org.jboss.netty.channel.SimpleChannelHandler.messageReceived(SimpleChannelHandler.java:142)
     at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
     at net.opentsdb.tsd.ConnectionManager.handleUpstream(ConnectionManager.java:75)
     at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
     at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
     at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
     at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
     at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
     at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
     at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
     at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
     at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
     at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
     at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
     at java.lang.Thread.run(Thread.java:662)

This one comes from asynchbase and seems to indicate that HBase
unexpectedly closed a connection from TSD. This is not supposed to
happen, unless maybe your HBase RegionServers are crashing. Was there
any crash or anything unusual on the HBase side?

I didn't see any problems with the HBase cluster.

Chi

tsuna

unread,

May 10, 2013, 2:52:00 AM5/10/13

to Chi Hoang, open...@googlegroups.com, Chris Larsen

Thanks again for the additional details.

On Thu, May 9, 2013 at 11:47 AM, Chi Hoang <chi....@gmail.com> wrote:
> Sorry I meant locks. I remember that I saw at least two different types of
> messages but could only find one in my current set of log files:
> 2013-05-03 18:56:16,780 WARN [New I/O worker #38] UniqueId: Failed to lock
> the `MAXID_ROW' row
> org.hbase.async.RemoteException: java.io.IOException: Timed out on getting
> lock for row=\x00

OK so yeah this is row lock contention in HBase, and that's going to
disappear as we gradually remove all explicit row locks. It would be
interesting to determine whether this occurred because of the new
metadata tracking code, or because of the existing UniqueId code.
Maybe try applying the same workload again without metadata tracking
enabled and see if the problem goes away?

> This type of message gets repeated many times in my log files:
>
> 2013-05-04 12:49:36,034 ERROR [New I/O worker #52] HBaseClient: WTF? Trying
> to add AtomicIncrementRequest(table="tsdb-uid", key=[0, 1, 39, 0, 0, 1, 0,
> 0, 1, 0, 0, 2, 0, 0, 11, 0, 0, 3, 0, 0, 59], family="name", qualifier="ts_ct
> r", amount=1, attempt=5, region=RegionInfo(table="tsdb-uid",
> region_name="tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0.",
> stop_key=[0, 4, -33, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 28, 0, 0, 3, 0, 0,
> 14])) twice to NSREd
> RPC on "tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0."

This is not supposed to happen. Chris, can you think of a case where
we potentially send the same AtomicIncrementRequest twice? The
HBaseRpc objects can be re-used, but they cannot be used multiple
times concurrently. In other words, if you want to issue two
identical AtomicIncrementRequests at the same time, you cannot just
hand the same RPC twice to asynchbase. I'm not sure the code protects
itself against such usage, but I imagine it's one way the "WTF"
message above could occur.

> I ran jstack for different periods, and the 100% single process scenario
> included these blocks:
>
> New I/O worker #5" prio=10 tid=0x00002aaab4170000 nid=0x5f17 waiting for
> monitor entry [0x000000004371d000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at org.hbase.async.RegionClient.sendRpc(RegionClient.java:835)
> - waiting to lock <0x0000000268f9a3d0> (a org.hbase.async.RegionClient)
> at org.hbase.async.HBaseClient.sendRpcToRegion(HBaseClient.java:1501)

A thread that is blocked on a lock doesn't normally consume CPU
cycles. Are you saying that you saw all threads contending on that
same lock? Can you share the full output of jstack?

This lock is only acquired in one of two cases:
1. We just got disconnected from an HBase RegionServer, and we need
to retry or fail the RPCs that are being sent to this RegionServer.
2. We are just about to establish a new connection to an HBase
RegionServer, and we need to buffer the RPC until the connection is
fully established.

Both of those cases aren't expected to happen often in a "steady
state". In your earlier email, you pasted excerpts of stack traces
from asynchbase showing unexpected connection issues (e.g.
"IOException: Connection reset by peer" or "ClosedChannelException").
So the fact that you're now showing us a case where a thread is
blocked in that part of the code that specially handles
not-yet-connected or just-disconnected situations makes me thinks that
there are connection problems between TSD and HBase, and that you need
to understand why this is occurring, because this is not normal.

If you don't see any complaints on the HBase side, then you may have
some other network-level problem that is causing those connections to
break. They don't break without a reason. Specifically, a
"Connection reset by peer" means that the RegionServer was not running
when TSD tried to connect to it, and so the kernel of the remote
machine sent back a TCP reset.

You should probably figure out what's going on there before making any
performance measurements or drawing any conclusion, as something is
obviously not right.

--
Benoit "tsuna" Sigoure

ManOLamancha

unread,

May 10, 2013, 7:29:06 PM5/10/13

to open...@googlegroups.com, Chi Hoang, Chris Larsen

On Friday, May 10, 2013 2:52:00 AM UTC-4, tsuna wrote:

> This type of message gets repeated many times in my log files:
>
> 2013-05-04 12:49:36,034 ERROR [New I/O worker #52] HBaseClient: WTF? Trying
> to add AtomicIncrementRequest(table="tsdb-uid", key=[0, 1, 39, 0, 0, 1, 0,
> 0, 1, 0, 0, 2, 0, 0, 11, 0, 0, 3, 0, 0, 59], family="name", qualifier="ts_ct
> r", amount=1, attempt=5, region=RegionInfo(table="tsdb-uid",
> region_name="tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0.",
> stop_key=[0, 4, -33, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 28, 0, 0, 3, 0, 0,
> 14])) twice to NSREd
> RPC on "tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0."

This is not supposed to happen. Chris, can you think of a case where
we potentially send the same AtomicIncrementRequest twice? The
HBaseRpc objects can be re-used, but they cannot be used multiple
times concurrently. In other words, if you want to issue two
identical AtomicIncrementRequests at the same time, you cannot just
hand the same RPC twice to asynchbase. I'm not sure the code protects
itself against such usage, but I imagine it's one way the "WTF"
message above could occur.

Just catching up after being offline for a few days. We're creating a new AIR on each data point:


final AtomicIncrementRequest inc = new AtomicIncrementRequest(
        tsdb.uidTable(), tsuid, FAMILY, COUNTER_QUALIFIER);
    tsdb.getClient().bufferAtomicIncrement(inc).addCallback(
        new TSMetaCB(tsdb, tsuid));

I could imagine it being possible on a very busy system that identical increments could be created for the same timeseries on different RPC threads, but they would still be separate requests.

tsuna

unread,

May 11, 2013, 12:31:29 AM5/11/13

to ManOLamancha, open...@googlegroups.com, Chi Hoang

On Fri, May 10, 2013 at 4:29 PM, ManOLamancha <clars...@gmail.com> wrote:
> I could imagine it being possible on a very busy system that identical
> increments could be created for the same timeseries on different RPC
> threads, but they would still be separate requests.

Then that's fine. The only thing I was worried about is that one
particular instance would be used twice concurrently.

Then it's a mystery why this occurred. It's definitely a bug somewhere though.

Chi, are you seeing this often? Are you able to reproduce easily?

--
Benoit "tsuna" Sigoure

Chi Hoang

unread,

May 13, 2013, 6:34:13 PM5/13/13

to open...@googlegroups.com, ManOLamancha, Chi Hoang

I rebuilt my test environment with the following changes:
- pre-split regions using Jonathan's scripts
- loading 8mm dps every minute instead of every two minutes (trying to accelerate "breakage")

It's been running for nearly 4 hours, and only had one problem - I had to restart HBase, after which, TSD single threaded, and I had to restart TSD to get the througput back up. The only error message was in regards to a Broken Pipe, which you advised was likely HBase connectivity. In this case, I restarted HBase, so I wasn't surprised by that error.

I'm going to pause the test runs for a few hours, and restart later in the evening, and let it run overnight to see if any errors come back. So far, I have not seen the scary looking ones.

Chi

Chi Hoang

unread,

May 14, 2013, 1:37:04 PM5/14/13

to open...@googlegroups.com, ManOLamancha, Chi Hoang

I let my test loads run overnight, and TSD stopped responding after about 10 hours. I didn't see any errors in HBase or the TSD logs, so I'm not sure what went wrong. I restarted TSD, and things are running again. I didn't think to take a jstack dump, so I'll let it run again, and see what happens.

Chi

Chi Hoang

unread,

May 15, 2013, 2:08:09 PM5/15/13

to open...@googlegroups.com, ManOLamancha, Chi Hoang

My test loads ran for about 15 hours this time before falling over. I've attached my TSD log file with all the INFO records (quite chatty with all the network activity logging) filtered out. It looks like the system might have gotten busy with the initial set of errors, then snowballed to the point of failing over with the WTF messages. I don't think this is a network problem as we were running smoothly until things fell apart.

I hope this helps.

Chi

tsd-errors.txt.gz

ManOLamancha

unread,

May 16, 2013, 9:40:22 PM5/16/13

to open...@googlegroups.com, ManOLamancha, Chi Hoang

On Wednesday, May 15, 2013 2:08:09 PM UTC-4, Chi Hoang wrote:

My test loads ran for about 15 hours this time before falling over. I've attached my TSD log file with all the INFO records (quite chatty with all the network activity logging) filtered out. It looks like the system might have gotten busy with the initial set of errors, then snowballed to the point of failing over with the WTF messages. I don't think this is a network problem as we were running smoothly until things fell apart

Hmm, it looks like most of these are Compaction related. I wonder if the compaction queue is backing up, Tsuna would have a better idea. You may want to try disabling compactions and see if that helps. They're enabled by default and you can disable compaction in the config file.

tsuna

unread,

May 20, 2013, 3:51:44 AM5/20/13

to Chi Hoang, open...@googlegroups.com, ManOLamancha

Sorry for the slow reply Chi.

On Wed, May 15, 2013 at 11:08 AM, Chi Hoang <chi....@gmail.com> wrote:
> My test loads ran for about 15 hours this time before falling over. I've
> attached my TSD log file with all the INFO records (quite chatty with all
> the network activity logging) filtered out. It looks like the system might
> have gotten busy with the initial set of errors, then snowballed to the
> point of failing over with the WTF messages. I don't think this is a
> network problem as we were running smoothly until things fell apart.
>
> I hope this helps.

It does.

The first error logged in your log file is:

2013-05-15 08:34:34,662 ERROR [New I/O worker #51] RegionClient:
Unexpected exception from downstream on [id: 0xe36f69cb,
/yy.yy.yy.yy:42587 => /10.20.78.53:60020]

java.io.IOException: Connection reset by peer

2013-05-15 08:34:34,713 ERROR [New I/O worker #51] RegionClient: After
decoding the last message on [id: 0xe36f69cb, /yy.yy.yy.yy:42587 :>
/10.20.78.53:60020], there was still some undecoded bytes in the
channel's buffer (which are going
to be lost): ReplayingDecoderBuffer(ridx=13, widx=3510)=[…]

This indicates that the RegionServer at 10.20.78.53 brutally
terminated the TCP connection between TSD and itself, most probably
because the RegionServer crashed. Can you confirm the RegionServer
crashed at that time?

There is then a whole bunch of "java.io.IOException: Broken pipe" for
sockets that are like "[id: 0xbb3ff31e, /xx.xx.xx.xx:35365 =>
/yy.yy.yy.yy:4242]" – these are obviously client sockets, between TSD
clients and TSD. Why would those get disconnected right after an
HBase node seemingly crashed is a mystery to me. Having an HBase node
crash is one thing, and suddenly seeing connection issues to all sorts
of TSD clients is another. The two occurring simultaneously can't be
a coincidence. Are you really absolutely certain you're not having
network problems of some sort?

Subsequent stack traces show that asynchbase is trying to retry some
RPCs but failing to, and the stack traces are quite long, indicating
that it may be looping for a bit before giving up. This could maybe
account for the single threading you're sometimes observing. I'm
planning to add a per-RPC "fail fast" knob in asynchbase, it would
probably help avoid this problem here.

Although your log ends with a "OutOfMemoryError: GC overhead limit
exceeded", and this would be a lot more likely cause single-threading
for a while, when the GC falls back to a single threaded
stop-the-world collection.

--
Benoit "tsuna" Sigoure

Reply all

Reply to author

Forward