Fwd: Opentsdb support case draft

34 views
Skip to first unread message

Josh Zignego

unread,
Sep 15, 2020, 3:53:38 PM9/15/20
to open...@googlegroups.com

We are running OpenTSDB version 2.2.0 with one OpenTSDB table on a MapR 5.2.2 cluster (MapR-DB being the underlying datastore). We have been experiencing a lot of slowness with writes to our OpenTSDB table. We have detected that the slowness is not within our MapR HBase puts, but rather our OpenTSDB process itself that is causing the slowness.

 

Typically, we see write throughputs of ~100,000 records/sec, but over the past 3 weeks those rates have fallen to only about 1,000/sec now, which is a huge performance degradation.

 

When running a stack trace against our OpenTSDB processes, we see a lot of threads often get stuck WAITING at this point:

 

"OpenTSDB I/O Worker #8" #17 prio=5 os_prio=0 tid=0x00007ffff042f800 nid=0x509b in Object.wait() [0x00007fffd9b0a000]

   java.lang.Thread.State: WAITING (on object monitor)

            at java.lang.Object.wait(Native Method)

            at java.lang.Object.wait(Object.java:502)

            at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1138)

            - locked <0x00000006535b77e0> (a com.stumbleupon.async.Deferred$Signal)

            at com.stumbleupon.async.Deferred.joinUninterruptibly(Deferred.java:1064)

            at net.opentsdb.uid.UniqueId.suggest(UniqueId.java:769)

            at net.opentsdb.core.TSDB.suggestMetrics(TSDB.java:999)

            at net.opentsdb.tsd.SuggestRpc.execute(SuggestRpc.java:80)

            at net.opentsdb.tsd.RpcHandler.handleHttpQuery(RpcHandler.java:283)

            at net.opentsdb.tsd.RpcHandler.messageReceived(RpcHandler.java:134)

            at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

            at org.jboss.netty.handler.timeout.IdleStateAwareChannelUpstreamHandler.handleUpstream(IdleStateAwareChannelUpstreamHandler.java:36)

            at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

            at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)

            at org.jboss.netty.handler.timeout.IdleStateHandler.messageReceived(IdleStateHandler.java:294)

            at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

            at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

            at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)

            at org.jboss.netty.handler.codec.http.HttpContentEncoder.messageReceived(HttpContentEncoder.java:82)

            at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)

            at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

            at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)

            at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)

            at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

            at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

            at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)

            at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)

            at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)

            at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)

            at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)

            at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

            at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

            at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)

            at org.jboss.netty.channel.SimpleChannelHandler.messageReceived(SimpleChannelHandler.java:142)

            at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)

            at net.opentsdb.tsd.ConnectionManager.handleUpstream(ConnectionManager.java:87)

            at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

            at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)

            at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)

            at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)

            at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)

            at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)

            at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)

            at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)

            at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)

            at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)

            at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)

            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

            at java.lang.Thread.run(Thread.java:745)

 

This thread will be waiting on ‘net.opentsdb.uid.UniqueId.suggest(UniqueId.java:769)’ for a long time, as running the stack trace at later intervals reveals the thread is still waiting on this call to return.

 

We have checked our OpenTSDB stats endpoint, and find that we are not near our limits for UID’s:

tsd.uid.cache-hit

1600184822

23523956

kind=metrics host=client001

tsd.uid.cache-miss

1600184822

711344

kind=metrics host= client001

tsd.uid.cache-size

1600184822

1413859

kind=metrics host=client001

tsd.uid.random-collisions

1600184822

13

kind=metrics host=client001

tsd.uid.ids-used

1600184822

0

kind=metrics host=client001

tsd.uid.ids-available

1600184822

0

kind=metrics host=client001

tsd.uid.cache-hit

1600184822

48115583

kind=tagk host=client001

tsd.uid.cache-miss

1600184822

62

kind=tagk host=client001

tsd.uid.cache-size

1600184822

102

kind=tagk host=client001

tsd.uid.random-collisions

1600184822

0

kind=tagk host=client001

tsd.uid.ids-used

1600184822

1377

kind=tagk host=client001

tsd.uid.ids-available

1600184822

4294965918

kind=tagk host=client001

tsd.uid.cache-hit

1600184822

48107077

kind=tagv host=client001

tsd.uid.cache-miss

1600184822

7369

kind=tagv host=client001

tsd.uid.cache-size

1600184822

14710

kind=tagv host=client001

tsd.uid.random-collisions

1600184822

0

kind=tagv host=client001

tsd.uid.ids-used

1600184822

41099

kind=tagv host=client001

tsd.uid.ids-available

1600184822

4294926196

kind=tagv host=client001

 

We have already deduced that HBase writes are not the issue, as there is very little delay between HBase receiving put requests and writing to MapR-DB, a strong indicator the delay is occurring in the OpenTSDB process. 

Do you have any inclinations what might be causing the suggest Unique ID calls to be taking so long? Or any other ideas where to look for the decreased writing writes? 

Josh Zignego

unread,
Sep 15, 2020, 4:07:15 PM9/15/20
to OpenTSDB
We have upgraded to OpenTSDB 2.4.0, and the issue persists.

Jonathan Creasy

unread,
Sep 15, 2020, 9:59:28 PM9/15/20
to Josh Zignego, OpenTSDB
Are your regions splitting often or are you having any instability in the region assignments?

I haven't gotten to actively use OpenTSDB in 5 years or so, so I apologize, but I swear I used to know what this issue is. :-/

--
You received this message because you are subscribed to the Google Groups "OpenTSDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opentsdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opentsdb/738c6d52-3b0d-4ae0-8aa1-39880444f446n%40googlegroups.com.

Josh Zignego

unread,
Sep 16, 2020, 11:53:35 PM9/16/20
to OpenTSDB
We run a region packing job to reclaim space in the regions after metrics get deleted from them. We saw some slow down writing to OpenTSDB, so we actually stopped running our region packing for a couple of weeks since that is known to negatively impact OpenTSDB performance. Interestingly, we just started running it again and the performance of writes has increased. So it could be that we weren't running the region packing often enough. 

Did you ever have to run region packing? I really appreciate you looking at the question!

Jonathan Creasy

unread,
Sep 17, 2020, 9:32:54 AM9/17/20
to Josh Zignego, OpenTSDB
No, I never ran any region packing. One issue that I only vaguely remember is that the asynchbase client was not very resilient in the face of HBase regionserver changes. So, if a regionserver went away, or if a region moved, it took OpenTSDB a long time to catch on and the OpenTSDB instance would just hang. We resolved this by running 10 query instances in docker containers per node and putting HAProxy on each datanode, then running HAProxy on the edge nodes. We load balanced queries across thes 360 OpenTSDB instances with a limit of 2 connections per instance. We had an automated task that would restart the OpenTSDB container if a single instance failed. It was something that was going to be fixed, but I think the fix went into the 3.0 line.

I don't think this is the same issue you are having though. We did the same thing with writes in terms of load balancing. We had one write instance on each node and a proxy on the edge node for writes. We used 4243 for write traffic and 4242 for queries. 

I assume you aren't being bitten by the old tCollector stops working in Sept 2020 bug (16 million second unix timestamp). That would cause a big drop in writes coming from TCollector instances. 

Reply all
Reply to author
Forward
0 new messages