Is OpenTSDB the correct choice?

Matias Bonaventura

unread,

May 26, 2015, 12:37:53 PM5/26/15

to open...@googlegroups.com

Hi all,

I am evaluating using OpenTSDB or other TS Data Bases (InfluxDB, Graphite, etc) and I wanted to know, given your experience, if these are the correct choice for my purpose.

I do network simulations. When the simulation runs, it produces several metrics to track the evolution of the models (queue lenghts, BW, latencies, application variables, etc, etc). These metrics are logged and then analyzed. The metrics are time series (basically variableName + variableValue + timeStamp), that is why we though of using a TSDB.
Currently the simulator supports writing the metrics to a file (in SOD Scilab format, HDF5) which we then consume from R to analyze and generate plots. We are simulating quite a big network (~3000 nodes) so the simulation can produce a lot of information. For example, if we simulate 60s of operation (which takes ~8 hours of execution) we end up with ~16M entries (about 550 entries per second of execution).

We started with a much smaller model and now the amount of metrics is becoming hard to analyse. We would benefit from all the Time series operations and queries. Also, the simulation runs in a single machine, but we would benefit from a distributed schema as we launch multiple indepenent simulations in different nodes, each node with different parameters. So merging the results in a single DB would ease the analisys when we scan several parameters.
So we are evaluating the convenience of logging directly to a TSDB, but after reading a bit of the documentation I have some questions:

Most importantly: would OpenTSDB be able to handle this amount operations per second (probably more)?
I read in the FAQ that a TSD can handle ~2000 new entries per second. Using batch updates for example (keeping things in memory and sending data to the TSD every certain amount of entries) would improve the performance?

- I saw that openTSDB allows doing downsampling, grouping, interpolation, but I havent found if doing operatios that involve more than one metric (sql-like "joints") are possible to do on-the-fly.

- We need to support millisecond and sometimes microsecond precision. I read that openTSDB support second precision and millisecond precision with some limitations.
As we need to log a very short period (max 5min) of time but with a lot of precision, do you think there will be a lot of workarounds if we represent our microseconds as seconds in TSDB (second 5 would be recorded in TSDB with timestamp 5000000).

- I realize that although OpenTSDB is a generic TSDB, it is oriented to the scenario of logging real networking metrics (like metrics from a farm). Do you think it can still handle this "simulation" scanerio?

Thank you very much for any comment or ideas you can provide.

Regards,
Matias

Loic

unread,

May 30, 2015, 3:30:41 AM5/30/15

to open...@googlegroups.com

Hi Mathias.

I am following OpenTSDB, but I know KairosDB better. I will try to answer with my best knowledge of TSDBs.

- I read in the FAQ that a TSD can handle ~2000 new entries per second. Using batch updates for example (keeping things in memory and sending data to the TSD every certain amount of entries) would improve the performance?

Both KairosDB and OpenTSDB would support this throughput. We have KairosDB nodes with an aquisition rate of 50,000 samples per second.

- I saw that openTSDB allows doing downsampling, grouping, interpolation, but I havent found if doing operatios that involve more than one metric (sql-like "joints") are possible to do on-the-fly.

Neither OpenTSDB nor KairosDB would allow sql-join-like operations. You would have to manage join on the client-side by doing multipe queries. I think InfluxDB would.

- We need to support millisecond and sometimes microsecond precision. I read that openTSDB support second precision and millisecond precision with some limitations.
As we need to log a very short period (max 5min) of time but with a lot of precision, do you think there will be a lot of workarounds if we represent our microseconds as seconds in TSDB (second 5 would be recorded in TSDB with timestamp 5000000).

Neither OpenTSDB nor KairosDB support microsecond precision, both would require a code change. KairosDB supports natively millisecond precision without limitation. InfluxDB natively supports microsecond precision.

- I realize that although OpenTSDB is a generic TSDB, it is oriented to the scenario of logging real networking metrics (like metrics from a farm). Do you think it can still handle this "simulation" scanerio?

It would definitely work, with the indicated limitations (join, time precision).

ManOLamancha

unread,

May 31, 2015, 7:17:12 PM5/31/15

to open...@googlegroups.com

On Tuesday, May 26, 2015 at 9:37:53 AM UTC-7, Matias Bonaventura wrote:

Hiya, Loic pretty much got it but I'll fill in a bit more

Most importantly: would OpenTSDB be able to handle this amount operations per second (probably more)?
I read in the FAQ that a TSD can handle ~2000 new entries per second. Using batch updates for example (keeping things in memory and sending data to the TSD every certain amount of entries) would improve the performance?

The throughput will really depend on your HBase cluster. Is that 550 values per second on a per node basis or across the entire cluster? If that's for your entire simulation you'll be perfectly fine. If it's per node and you're looking at 1.6M values per second then you'll need beefy HBase machines and a number of nodes. We're running about 70 HBase nodes and pushing a steady 3M points per second with peaks at 7M. Each TSD handles about 100K wps using the telnet style interface. If you can batch then that definitely helps a bit but you'd likely be fine writing straight to the socket.

- I saw that openTSDB allows doing downsampling, grouping, interpolation, but I havent found if doing operatios that involve more than one metric (sql-like "joints") are possible to do on-the-fly.

Not yet. I have some beta code about to hit production and once it's ready I'll upstream it. Theres also some good work from Turn around expressions that we may pull in too.

- We need to support millisecond and sometimes microsecond precision. I read that openTSDB support second precision and millisecond precision with some limitations.
As we need to log a very short period (max 5min) of time but with a lot of precision, do you think there will be a lot of workarounds if we represent our microseconds as seconds in TSDB (second 5 would be recorded in TSDB with timestamp 5000000).

Right, unfortunately no microsecond support but you can fully utilize the millisecond support, just with batched puts. One option if you really need the precision but don't need a value at every microsecond is to write the precise value as an annotation along with a millisecond value. If you want native support, it wouldn't be too difficult to hack, just a matter of adding more bytes to the column qualifiers and tweaking the read/write paths all the way through.

- I realize that although OpenTSDB is a generic TSDB, it is oriented to the scenario of logging real networking metrics (like metrics from a farm). Do you think it can still handle this "simulation" scanerio?

Sure there are folks who use it for such simulations but the microsecond precision may be a kicker for you. Thanks!

Matias Bonaventura

unread,

Jun 3, 2015, 4:43:46 AM6/3/15

to ManOLamancha, open...@googlegroups.com

Loic, ManOLamancha, thank you very much for your answers.

I started doing some first test and I think that openTSDB will definitely perform better than our current backend.
Regarding the throughput it really depends on how we configure the simulation. With low logging (the best our current backend can handle) it writes at ~500 points per/s. With medium/high logging it can raise until 100K/500K. To start with, we will run a single TSB in the same node as the simulation, so we will have to find a balance on how much to log.

The simulator is implemented in C++, so I'm coding the openTSDB clients in C++ as well (I found this simpleTSDDBClient as a starting point).

Is your experience, which input method performs better? Telnet or the HTTP API (which allows sending several points and compression)?

Regarding the microsecond precision, to avoid recompiling we will give it a try logging microseconds as seconds using 10 digit timestamps. That is, second 1 in the simulation will be logged with timestamp 0001000000. That would give us a maximum simulation time of 4294 seconds which is way more than we need. Of course we won't be able to use the "1h" notation and existing plotting tools will show a wrong scale, but in general we will be plotting with custom scripts in R.

The strongest drawback would be the limited query language, so it would be nice to see your implementation in production.
Anyway, I think I will have to do a proof of concept to check InfluxDB performance because of the query language and native microsecond support. The problem with InfluxDB is that it is a still beta.