Description of one high performance OpenTSDB cluster

439 views
Skip to first unread message

Jonathan Creasy

unread,
Jul 11, 2016, 1:33:30 PM7/11/16
to OpenTSDB

My most recent high performance layout has 36 datanodes, 2 edge nodes, 3 name nodes.

Data Nodes (36x)

  • HDFS Datanode
  • HBase Regionserver
  • TSD "rw" mode, used as "wo" (TCP 4243)
  • HAProxy (TCP 4242)
  • 10x TSD "ro" mode (TCP 8001-8010)
Edge Nodes (2x, with floating VIP)

  • Splicer Query Engine (NativeIP:4245)
  • HAProxy (FloatingIP:4242, FloatingIP:4243, FloatingIP:4245, NativeIP:4243)
  • TSD "rw" mode, not typically used (NativeIP:4242)

Write Path (always port 4243)

PUT commands were executed via the TELNET protocol. TCollector was configured with the native IP address of both Edge nodes, and would self balance, failover across the 10,000 or so clients.

On each machine in the network, there was the tcp_bridge and udp_bridge collectors which listened on port 4243 so all self-instrumented applications would always push their metrics to "localhost:4243"

When the writes would arrive on the Edge Nodes, the HAProxy instance there would balance the writes across the TSD instances (4243) on each Data Node.

Read Path (always 4242 for OpenTSDB and 4245 for Splicer)

For the query path, there is the same pair of edge nodes as the writes. Queries would hit the floating IP on port 4245 for the Splicer query engine.

Splicer takes an incoming query, breaks it up into 1 hour chunks, and spreads the shared queries across the 10 docker containers on the datanode where that region lives. The 1 hour results are cached in Redis so that Grafana dashboards showing weeks and months of data will work quickly.

Queries could be, and sometimes were, executed against OpenTSDB directly, using port 4242 on the Edge nodes. When this would happen, the HAProxy instance would balance the incoming connections, in HTTP mode, across the HAProxy instances (port 4242) on each Data Node.

Peter Childs

unread,
Jul 12, 2016, 1:00:48 AM7/12/16
to OpenTSDB

The splicer idea sounds very interesting.   What sort of redis backing do you use for this, and what type of hit rates do you see?

I assume you architected in this manor to 'solve' an issue?

Each Data Node has 10x TSD's in RO ?   Again was this to 'solve' an issue ?

What type of volume of read/writes required this type of scale?   What type of spindle/disk deployment did you need on the data nodes (ie large qty of smaller, small qty of lager, balanced)

Given that this looks like a pretty serious deployment what type of pain do you experience when parts of it break?

Thanks for sharing.

Jonathan Creasy

unread,
Jul 12, 2016, 2:56:50 AM7/12/16
to Peter Childs, OpenTSDB
One of the main problems we needed to solve was that people wanted dashboards with 3 weeks of data, for thousands of servers reporting dozens of tag values on a metric every 15 second. So a query for an hour or two was trying to move 2GB of data across the network between the regionserver that had the data and the TSD node that happened to get the query.

So, we solved that by using the OpenTSDB code to determine which regionserver had a given metric so we could send the requests to the local node.

For Redis, I think each edge node has 2GB of ram allocated and our hit rates are really high. The reason is that we have dozens of screens around the company showing a week of data for 10-12 panels and a refresh rate of 5m, so every 5m it goes and queries the data. If it asks for all of it, we get *most* of it out of the cache because the only part that changes is the most recent hour (partial hour).

Redis settings:

redis::conf_unixsocket: /tmp/redis.sock
redis::conf_unixsocketperm: 775
redis::conf_maxclients: 100
redis::conf_maxmemory: 34359738368
redis::conf_maxmemory_policy: allkeys-lru
redis::conf_appendfsync: no
redis::conf_appendonly: no

It's pretty stable, one part that is a little fragile right now is that Splicer doesn't handle it well when one of the backends is not reachable. This results in some queries returning a 500.

The 10 TSDs in RO mode helps with the number of parallel queries we have. Partially because Splicer is breaking the queries up and executing them in parallel and partly because we have a really high query volume on that cluster.

-Jonathan

Jonathan Creasy

unread,
Jul 12, 2016, 2:59:14 AM7/12/16
to Peter Childs, OpenTSDB
"What type of volume of read/writes required this type of scale?   What type of spindle/disk deployment did you need on the data nodes (ie large qty of smaller, small qty of lager, balanced)"

The write volume is about 3 million datapoints per second, it has reached 10 million during recovery periods.

The datanodes are 8x spinning disks, not even SSD. I don't remember the exact specs. They are supermicro nodes with 4 nodes per chassis. They might be 2TB drives or something.

casey.d...@teamaol.com

unread,
Jul 20, 2016, 12:17:55 PM7/20/16
to OpenTSDB, pjch...@gmail.com
We tried using  the Splicer Query Engine Johnathan is referring to here and had an issue. We are using version 2.2.0 of opentsdb. When using the RegionChecker class in program execution, there is a variable defining the metric width. This is defined as 4 bytes, but for us the width was 3 bytes. This caused an ArrayIndexOutOfBoundsException and the application would fail.

Not sure if the width is defined as 4 bytes by default in version 2.3.0-SNAPSHOT, maybe Johnathan can chime in.

But the good news is: without posting actual metrics on the performance increase we saw grafana graphs of multi-day time frames return in seconds or less, where these queries would often not return at all and just timeout, when using grafana directly to opentsdb reader instances.

Thanks for this original post Johnathan

Jonathan Creasy

unread,
Jul 20, 2016, 10:36:54 PM7/20/16
to Casey Doerschuk, OpenTSDB, Peter Childs
We are using 4 byte widths in our system, which is modified from the default build, even in OpenTSDB 2.3.

I think you need 2.3.0 with Splicer only if you wish to use the gexp feature.
Reply all
Reply to author
Forward
0 new messages