Delay between write and read the same datapoints

147 views
Skip to first unread message

Mehran Shakeri

unread,
Sep 6, 2016, 12:10:06 PM9/6/16
to OpenTSDB
Hi,

I'm new to openTSDB and right now I'm in evaluation step. We want to change our database to a time-series specific database and I'm investigating different solutions.

One important task is to migrate from our current database to openTSDB and since it's a custom one I have to write migration by myself.

I'm using openTSDB 2.2 and latest HBase version (standalone for the moment) using LZ4 compression. 

To do migration I'm working with one of our customers database which consist of almost 500 time-series and each has almost 2M data points.

I had to increase some limitations in configuration and also enable Appending to be able to ingest my data as fast as possible. Still it is quite slow!

I used two ways:
1. Import from CLI: Always facing this exception even when ingesting only one time-series per time:

org.hbase.async.RemoteException: Call queue is full on testUser,42481,1473174792442, too many items queued ?

which makes it impossible to use.

2. Using HTTP api and then I send every 50000 dps in each request and set 500ms sleep between each request. This works in theory. No exception. But when I try to read the data, there are millions of dps missing!!!!! And I don't see any exception in logs or any HTTP error code. All requests are returned with 200 OK and no error!

So after writing, I tried to just go through all dps and check missing ones. When I immediately start reading what I've ingested, I don't receive almost any of ingested dps! So I set 2sec timeout between write and read and 50000 out of 2M were missed. (which is exactly one HTTP request). Now my question is how storage works? Should I wait for few seconds before reading my ingested data? 

ps: I use POST to api/query

{
   
"msResolution" : true,
   
"start": 1434558735000,
   
"queries": [
       
{
           
"aggregator": "sum",
           
"metric": "metric",
           
"tags": {}
       
}
   
]
}

ps2: HBase and openTSDB are running on the same machine :

Intel® Xeon(R) CPU E3-1226 v3 @ 3.30GHz × 4 
SSD
16GB ram
Ubuntu 16.04


Thanks in advance for any guide.

Cheers,
Mehran

Jonathan Creasy

unread,
Sep 6, 2016, 3:37:25 PM9/6/16
to Mehran Shakeri, OpenTSDB
Well, with all of the writes being async, you should wait a bit before reading. Most systems I have deployed I tell the customers (usually a development team) that the SLA of "data emitted from the application" to "data showing up on a graph" is one minute. This of course accounts for things like local TCollector buffering the data and network time and all that. It's usually much faster. So, if you wait 10-15 sec before checking with the query that should be more than sufficient.

500 series with 2M datapoints, in the grand scheme of things, is not that large, you should not have much trouble inserting it. One thing you can do is run a dedicated OpenTSDB instance for the writes, run it with "tsd.mode = wo".

I would use the telnet interface to write data, I do this when I do large imports or back-filling of data.

What kind of system are you getting the data out of? Is there anything special about your process that is doing the writes?

You may want to review this issue, #783, to see if it is related to the error you get with the telnet writes. From that issue, it looks like you should turn off "tsd.core.meta.enable_realtime_ts" and potentially pre-split the meta table, although with one server for everything, that won't really do anything.

Mehran Shakeri

unread,
Sep 7, 2016, 5:52:40 AM9/7/16
to OpenTSDB
Thank you Jonathan for reply.

We have some nodes which are running on embedded industrial computers and they must have their own full database. So there will be one HBase + TSDB in each node. And then we will replicate HBase some where else. But this fact that everything must be run in one node is our limitation. 

This 10-15 sec delay still doesn't work for migration. I even tried something else: After writing I waited few minutes and then I stopped TSDB and HBase just to be sure there won't be any thread in queue. After starting again still there was 55k dps missing.

Another question that comes with this delay: is this delay only necessary because of high amount of writes or we have it even in production with lower dps per second? Because part of our software is displaying real-time state of devices and that should be live. In working environment we barely have 4000 dps every 10-15 seconds. So it's almost nothing but our live charts must be real-time.

I also tried "tsd.mode = wo" and telnet, still the same exception!

Regarding issue #783, I couldn't find proper config parameter to increase the queue size. Also I found out there is an open issue and open PR for this still running in GitHub. 

Current work around for me is to wait few seconds, then start recovering data. I go through them and rewrite missing data. After 2-3 times usually I have complete data stored! 

I'm still open to better reasonable solution. Thanks in advance for any suggestion.

Cheers,
Mehran



Jonathan Creasy

unread,
Sep 20, 2016, 12:30:02 AM9/20/16
to Mehran Shakeri, OpenTSDB
Are you trying to query the data from the embedded devices?

I'm still not sure there isn't a better way to do what you're trying to do, it sounds, odd to say the least. There is probably something more elegant we can workout here.

-Jonathan

Jonathan Creasy

unread,
Sep 20, 2016, 12:32:54 AM9/20/16
to Mehran Shakeri, OpenTSDB
The configuration option is right here:


You can set it in your configuration by adding a line:

tsd.core.meta.enable_realtime_ts = false

Mehran Shakeri

unread,
Sep 20, 2016, 4:16:15 AM9/20/16
to OpenTSDB
Currently I'm working on a station with Xeon CPU 3.3 E3-1226 and 16GB RAM. But the production will be embedded.

What does "tsd.core.meta.enable_realtime_ts = false" do? As I understood from documentation it is helpful when I have new metric/tagk/tagv (meta data) in ingestion. Am I right? If yes then I don't think it helps since I'm importing all the meta data first and then data points. (Which was really effective).
Reply all
Reply to author
Forward
0 new messages