We are running OpenTSDB version 2.2.0 with one OpenTSDB table on a MapR 5.2.2 cluster (MapR-DB being the underlying datastore). We have been experiencing a lot of slowness with writes to our OpenTSDB table. We have detected that the slowness is not within our MapR HBase puts, but rather our OpenTSDB process itself that is causing the slowness.
Typically, we see write throughputs of ~100,000 records/sec, but over the past 3 weeks those rates have fallen to only about 1,000/sec now, which is a huge performance degradation.
When running a stack trace against our OpenTSDB processes, we see a lot of threads often get stuck WAITING at this point:
"OpenTSDB I/O Worker #8" #17 prio=5 os_prio=0 tid=0x00007ffff042f800 nid=0x509b in Object.wait() [0x00007fffd9b0a000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1138)
- locked <0x00000006535b77e0> (a com.stumbleupon.async.Deferred$Signal)
at com.stumbleupon.async.Deferred.joinUninterruptibly(Deferred.java:1064)
at net.opentsdb.uid.UniqueId.suggest(UniqueId.java:769)
at net.opentsdb.core.TSDB.suggestMetrics(TSDB.java:999)
at net.opentsdb.tsd.SuggestRpc.execute(SuggestRpc.java:80)
at net.opentsdb.tsd.RpcHandler.handleHttpQuery(RpcHandler.java:283)
at net.opentsdb.tsd.RpcHandler.messageReceived(RpcHandler.java:134)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.handler.timeout.IdleStateAwareChannelUpstreamHandler.handleUpstream(IdleStateAwareChannelUpstreamHandler.java:36)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.timeout.IdleStateHandler.messageReceived(IdleStateHandler.java:294)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.codec.http.HttpContentEncoder.messageReceived(HttpContentEncoder.java:82)
at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.SimpleChannelHandler.messageReceived(SimpleChannelHandler.java:142)
at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
at net.opentsdb.tsd.ConnectionManager.handleUpstream(ConnectionManager.java:87)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This thread will be waiting on ‘net.opentsdb.uid.UniqueId.suggest(UniqueId.java:769)’ for a long time, as running the stack trace at later intervals reveals the thread is still waiting on this call to return.
We have checked our OpenTSDB stats endpoint, and find that we are not near our limits for UID’s:
tsd.uid.cache-hit |
1600184822 |
23523956 |
kind=metrics host=client001 |
tsd.uid.cache-miss |
1600184822 |
711344 |
kind=metrics host= client001 |
tsd.uid.cache-size |
1600184822 |
1413859 |
kind=metrics host=client001 |
tsd.uid.random-collisions |
1600184822 |
13 |
kind=metrics host=client001 |
tsd.uid.ids-used |
1600184822 |
0 |
kind=metrics host=client001 |
tsd.uid.ids-available |
1600184822 |
0 |
kind=metrics host=client001 |
tsd.uid.cache-hit |
1600184822 |
48115583 |
kind=tagk host=client001 |
tsd.uid.cache-miss |
1600184822 |
62 |
kind=tagk host=client001 |
tsd.uid.cache-size |
1600184822 |
102 |
kind=tagk host=client001 |
tsd.uid.random-collisions |
1600184822 |
0 |
kind=tagk host=client001 |
tsd.uid.ids-used |
1600184822 |
1377 |
kind=tagk host=client001 |
tsd.uid.ids-available |
1600184822 |
4294965918 |
kind=tagk host=client001 |
tsd.uid.cache-hit |
1600184822 |
48107077 |
kind=tagv host=client001 |
tsd.uid.cache-miss |
1600184822 |
7369 |
kind=tagv host=client001 |
tsd.uid.cache-size |
1600184822 |
14710 |
kind=tagv host=client001 |
tsd.uid.random-collisions |
1600184822 |
0 |
kind=tagv host=client001 |
tsd.uid.ids-used |
1600184822 |
41099 |
kind=tagv host=client001 |
tsd.uid.ids-available |
1600184822 |
4294926196 |
kind=tagv host=client001 |
We have already deduced that HBase writes are not the issue, as there is very little delay between HBase receiving put requests and writing to MapR-DB, a strong indicator the delay is occurring in the OpenTSDB process.
Do you have any inclinations what might be causing the suggest Unique ID calls to be taking so long? Or any other ideas where to look for the decreased writing writes?
--
You received this message because you are subscribed to the Google Groups "OpenTSDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opentsdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opentsdb/738c6d52-3b0d-4ae0-8aa1-39880444f446n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opentsdb/88e86ff9-0a1b-40d8-94f7-aa7ecd6403b6n%40googlegroups.com.