OpenTSDB high load strange behavior

Guillaume Loetscher

unread,

Jul 9, 2016, 8:15:16 AM7/9/16

to OpenTSDB

Hi,

First of all, the title may be a bit misleading. I'm not suggesting that OpenTSDB is acting weird under high load, the issue is probably on my system / infrastructure, but still, I don't understand what's going on, so here I am.

OpenTSDB version : 2.2

OS : Amazon Linux 2016.03

Infrastructure : AWS EMR

master : 1 x m3.xlarge (4 CPU, 15GB RAM), running Namenode, ZK, Region Master and OpenTSDB
slaves : 4 x m3.xlarge, running Datanode and Region Master

A note on the infrastructure : I know that this infrastructure is not appropriate for a correct Hadoop / HDFS cluster. One master, no redundant Zookeeper, only 4 datanodes, etc... The purpose of this build is just to test out OpenTSDB and see how it's behaving, the backup strategy, etc... before going on a "real" production infrastructure.

Context : I'm currently testing OpenTSDB, more specifically, I'm doing load-testing right now. I've setup a bunch of Locust (http://locust.io), coded a small python script that generates a bunch of metrics that replicates what will be pushed in production, and launched everything to the OpenTSDB.

Using 4 locust process, I'm able to generate 1000 HTTT POST per secondes, each containing 200 metrics. I can see HBase indeed doing 200k inserts per seconds, loadbalanced on several nodes (I had pre-splitted the table and balanced all regions across all Hbase region servers available). CPU on the master is quite high, but manageable (between 2-3 load, generated mainly by OpenTSDB processes).

The issue is when I'm adding more load. By starting more locust process, the CPU load increases on the OpenTSDB (around 5 load, spikes to 6). The Locust producers are able to send the same throughput for a short amount of time (less than a minute), then the Locust throughput goes down, almost to a full stop (less than 10 POST / sec / process, when each process is capable of sending 250 POST / sec). However, the amount of writes on HBase is also very low. If I stop the producers, the CPU on OpenTSDB will stay high for several minutes (HBase write throughput still being very low), then suddendly, the load decreases. At this point, I'm able to restart Locust producers, and able to reach the previous high throughput.

Also, I can see that during almost the whole "high load" period, OpenTSDB was not able to write points to HBase. All grafana graphs shows a big "hole" during that period.

I'm not understanding what happenned during the very high load period. Here's some comments I have :

OpenTSDB is CPU bound (it's clearly mentioned in the documentation). So, by generating too many POST, I've overloaded the system, leading to the high load (a load of more than 4 indicates that there are processes waiting to be processed on the master)
So as I'm generating more query than OpenTSDB can handle, so I suppose it'll start to buffer. However, I don't know where it stores it. It seems that it's using its cache for this purpose (as stated in FAQ, question "What type of hardware should I run the TSDs on?") which should resides on tmpfs (so in RAM). I guess I have several questions here :

Is the "tsd.http.cachedir" used to store transient data points ? If yes, it's the main "buffer" storage when OpenTSDB is under high load ?
Can I use file located on a disk partition ? (I suppose it's possible, but not very clever, as disks will be very quickly I/O limited)
How OpenTSDB is using his processes memory ? Is it using it to store transient point as well ? How many RAM can the OpenTSDB processes can take ?

If OpenTSDB is caching data points when it's overloaded, why:

the load stayed high after I stopped the Locust producers ? It could mean that it's working to empty out the buffer but...
... then why the write throughput stays very low on HBase and why I have lost all the points during the "high load period" ?

Understanding how OpenTSDB react in case of high load would be very helpful. What I have understood so far :

OpenTSDB begins to be under heavy load and start caching data points "somewhere". It's still able to write points currently being processed.
Cache is being filled quickly and is finally full. OpenTSDB is then rejecting new inbound data points (hence the very poor throughput seen from the producers).
However, OpenTSDB processes are still able to process points. so to me, they should be able to take some points out of the cache, process them and write them to Hbase. For example, if we are pushing 2 times more data points that OpenTSDB is able to handle, it should still be able to process half of the points, and reject the other half. However, what I see is that it's dropping all the points. Why ?

Any advice would be greatly appreciated.

Many thanks,

Guillaume

Jonathan Creasy

unread,

Jul 9, 2016, 12:17:36 PM7/9/16

to Guillaume Loetscher, OpenTSDB

My initial advice, would be to use the telnet protocol to insert your data if at all possible.

I am not at my computer but will try and get back and give you some more useful advice later tonight.

What is your OpenTSDB config?

Sterfield

unread,

Jul 9, 2016, 1:44:54 PM7/9/16

to Jonathan Creasy, OpenTSDB

Are you suggesting to push data using telnet in order to debug the issue or in production ? The documentation discourages the usage of telnet to push data points.

As for the configuration, it's pretty standard. It's the one from the RPM package, with three modifications :

tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 40000
tsd.core.auto_create_metrics = true

Jonathan Creasy

unread,

Jul 9, 2016, 8:44:30 PM7/9/16

to Sterfield, OpenTSDB

It does? Huh, Ok, that's how I do it most of the time, I have built more than one 10 million dps cluster that way. To be honest, I haven't read all the docs lately, maybe I should.

I was also interested in the metadata related settings. Would it be possible to just hit /api/config and send us the whole blob? Maybe with the hostnames removed of course.

Jonathan Creasy

unread,

Jul 9, 2016, 8:48:52 PM7/9/16

to Sterfield, OpenTSDB

I see, yeah, it does, because you are essentially streaming the writes and you don't get back a "500" or anything if the writes fail. Yeah, the tradeoff there is performance though. I guess I haven't tested it, but I suspect that the telnet protocol is significantly more performant. I've built more than one 10 million DPS cluster, and I do 99% of the writes via the telnet protocol.

I have metrics that I watch, like the RPC counts and hbase writes. I can tell if they start failing. I also look at the bad value metrics and stuff like that.

I was also interested in the metadata related settings. Would it be possible to just hit /api/config and send us the whole blob? Maybe with the hostnames removed of course.

On Sat, Jul 9, 2016 at 12:44 PM, Sterfield <ster...@gmail.com> wrote:

Sterfield

unread,

Jul 11, 2016, 4:26:42 AM7/11/16

to Jonathan Creasy, OpenTSDB

Hello,

So yeah, documentation is specifying that we should use HTTP to push information (http://opentsdb.net/docs/build/html/user_guide/writing.html#telnet, see the "Note" sub-section).

Here's my current configuration : https://gist.github.com/sterfield/881820eea20fbcf700cb2b299a3429fb

As I said, nothing fancy :)

However, I saw that the tsd.http.cachedir is pointing to "/tmp/opentsdb". And there's two issues about that :

The folder is using the disk, more specifically the root partition
The partition is 10GB big, and already filled at 50%

So I will redo my tests, but most likely :

the locust overload the system and OpenTSDB tries to use the disk cache
disk cache is quite slow, which generates lots of I/O load on the system
disk cache will fill the entire "/" partition, leading to various issues like "system sluginess""

So in that case, it's not totally crazy that OpenTSDB is not able to write points to HBase, considering the high CPU and disk load put on the system, not to mention the disk space shortage.

I'll keep you posted.

Thanks,

Sterfield

unread,

Jul 11, 2016, 11:40:04 AM7/11/16

to Jonathan Creasy, OpenTSDB

So, some additional news :

On the master system, /tmp is a symbolic link that point to a separate partition than /. There's quite some space on this partition
However, the cache folder stayed empty, even on heavy load. I think that this cache is only for graphs and previous GET calls. This part of the documentation seems to indicate so.
The Resident memory usage of the OpenTSDB main process is increasing on heavy load, up to a 6GB limit. This limit is setup in the RPM init script :

# Set a default value for JVMARGS
: ${JVMXMX:=-Xmx6000m}
: ${JVMARGS:=-DLOG_FILE_PREFIX=${LOG_FILE} -enableassertions -enablesystemassertions $JVMXMX -XX:OnOutOfMemoryError=/usr/share/opentsdb/tools/opentsdb_restart.py}
export JVMARGS

So I suppose it's buffering data points in the Heap Memory, which is by default equal to 6GB. It's storing by default 10k RPC , so in case of slowiness on the OpenTSDB cluster or the HBase, it's buffering in his own memory.
In case of OutOfMemoryError, the script "/usr/share/opentsdb/tools/opentsdb_restart.py" is called. Basically, it's doing a "service opentsdb restart"
However, on EMR, openJDK 1.7 is installed. and it seems that the option "-XX:OnOutOfMemoryError" is not correctly implemented in OpenJDK 1.7.
Also, it seems that doing large queries can lead to OOM. I don't know what "large queries" means, but maybe it's best not to do too huge HTTP packets.

So, it seems that:

when the OpenTSDB fill its Heap Memory, it acts weirdly. A safe-guard has been implemented, in order to restart OpenTSDB in that case, but it's not working on EMR.
Assuming that the solution is to restart the OpenTSDB when it fills its memory, I think it's safe to say that configuration should be tune in order to avoid this situation at any costs.
In OpenTSDB 2.2, you can change "hbase.nsre.high_watermark=10000" to change the value of the number of RPC calls to store.

To be continued.

Jonathan Creasy

unread,

Jul 11, 2016, 11:52:40 AM7/11/16

to Guillaume Loetscher, OpenTSDB

I would think huge queries is related to the size f queries to fetch dat, not write data. You are using separate instances for reading correct?

You may benefit from the "wo" mode, it loads less of the internal rpc plugins.

The cache is mostly where it writes out data for gnuplot and where the gnuplot graphs are written, iirc.

What do your rpc related metrics look like?

Your write nodes are colo located on your region servers?

What does your query load look like during this time?

I assume your query nodes are "ro" mode?

-Jonathan

Sterfield

unread,

Jul 12, 2016, 12:13:30 PM7/12/16

to Jonathan Creasy, OpenTSDB

Separate instances for reading : no, I have only one OpenTSDB node, but I'm doing zero read at the moment
"wo" mode : Can you please explain ?
RPC metrics: The metrics are are classical. Here's an extract :

{"timestamp": 1468339458, "metric": "AAAAAAAAAAAAAAAAAAAAA0_MyID.instance0-mkt-prod1.three.cardinality.0", "value": 24, "tags": {"delivery": "0", "ip": "7", "host": "hostname0", "mx": "3"}

This is a three cardinality metric, I have also metrics with two cardinality and one cardinality (in addition to the "host" cardinality).

Obviously, the name will not be "three.cardinality.0", this is just for simulation purpose.

Write nodes coloed with region server : no, not at the moment. The openTsdb write node is on the master node. However, the network is not the bottleneck here, so I think this is not (yet) relevant.
Query load : you mean the CPU load ?
query nodes in "ro" nodes : No they are not, and I didn't see such option.

Thanks,

Guillaume

Jonathan Creasy

unread,

Jul 12, 2016, 1:23:33 PM7/12/16

to Sterfield, OpenTSDB

It seems that I need to make a paragraph about this in the documentation, I've been explaining it a lot lately. :)

Ok, so "ro", "rw" and "wo' modes. There is a setting "tsd.mode".

tsd.mode (2.1) String Optional Whether or not the TSD will allow writing data points. Must be either rw to allow writing data or ro to block data point writes. Note that meta data such as UIDs can still be written/modified. rw

This makes a much bigger difference than it should, at the basic level, this flag allows you to selectively not load a few of the built-in RPC handlers as they are not needed in certain modes.

This is in my fork, but I'm pointing you here because I'm working on a refactor and I think it is clearer what is going on here.

https://github.com/johann8384/opentsdb/blob/jcreasy/RPCRRefactor/src/tsd/RpcManager.java#L255

By query load, I meant, the volume of queries in terms of "size of returned dataset" and in terms of "number of query requests per second". Since you are doing no reads, this rules that out.

When your query nodes (the TSD instances you use to query your data) are in rw mode, this can have an impact on the write throughput as well. Because of this, it is best to have dedicated TSD instances for reads and writes. I typically run query instances in "ro" mode (tsd.mode) and I run the write instances in "rw" mode, simply because "wo" mode removes the version and config API. I'll probably change that in my upcoming RPC PR.

I actually wasn't aware of "wo" mode, but I noticed it in the code a few weeks ago, it literally just loads the PUT RPC plugin, so might be worth a try.

-Jonathan

Sterfield

unread,

Jul 13, 2016, 6:32:44 AM7/13/16

to Jonathan Creasy, OpenTSDB

Nice, I'll give it a try.

Right now, I'm doing some load tests, but putting less metrics per POST. The memory usage of OpenTSDB seems to be more manageable (stays around 3GB of resident memory), for 130k-140k metrics/sec. However, increasing the number of producers does not increase the number of metrics written per seconds, so I'm searching for what's the bottleneck here. But that's another story :)

Jonathan Creasy

unread,

Jul 13, 2016, 1:21:23 PM7/13/16

to Guillaume Loetscher, OpenTSDB

Thanks for your involvement, I always look forward to improving this tool.

Reply all

Reply to author

Forward