Your conclusions sound correct.
If I may ask, what is the context that has you duplicating Zabbix data into OpenTSDB? This seems like a strange thing to be doing.
We have an extensive monitoring structure in place with zabbix, crossing many data centers and 10 of thousands of machines. The monitoring structure is 'designed in' to several operations groups so changing out Zabbix is not an option. Zabbix cannot keep data longer then a couple of weeks and it does a very poor job at aggregation (across meta data, like tags) causing lots of down stream processing and, of course, no long term data. We are under a direction to build an archive that can store high resolution data for at least 1 year, preferably 2 years. We do not want to deploy and manage another set of agents on all the machines (we already have a group and infrastructure that does this). Our shortest and best option right now is to 'intercept' zabbix data, map zabbix meta data to the intercepted data and push into a real TSData data base. We have been doing this with openTSDB for over a year for selected metrics (required by our BI team). Management has now upped the requirements. As I mentioned, we now can accept and map zabbix TSdata to zabbix meta data and produce an openTSDB string at 40K+ per seconds.
> 3) Should I consider bypassing the tcollector and building my own high speed push to tsd's.
I was done exactly that. Now I use flume for data movement and my flume sink for putting data into opentsdb.Right now TSDB is tend to OOM, due of the fact, that it don't cap it's memory usage.
Chris, what kind of hardware are you running where you can get 60K samples on a single tsd. I have not done any testing yet. Tsuna's docs indicate a single core modest speed instance can easily handle 2K. I also understand Tsuna has worked on that part extensively since he first wrote his doc. Today, I will be lifting code and setting up the zabbix collector to pass through a 'SendThread' class.
We looked at intercepting zabbix data at the proxies but the zabbix itemId is not attached to the data at that point. To get all the zabbix metadata attached, we need the zabbix itemId. In researching, I found a good intercept point in the dbcache.c zabbix code (in the lib directory I believe). If you take a look at the history.glueon fork done by the Asian crew, you can see exactly where that intercept is.The challenge with zabbix meta data is in the mapping recipes Different zabbix installations would required different recipe code. I'll look at building a class that can be inherited so developers can build their on recipes for handling the difficult stuff like zabbix key names, zabbix key parameters and zabbix groups. All require some thought and possible 'policies' that zabbix must follow. (Groups were a particularly difficult concept to master until we established a 'naming convention' that the zabbix folk must follow)
1) Is there a limit to the number of metrics: The simple answer is no. However, the python dictionary used for de-duping will get large (as large as my zabbix meta data mapper!). I don't see a serious problem here. In my case, de-duping may not be necessary as zabbix has already done much of this work (by reducing the sample rate on non changing data).
2) My collector can process 40K+ samples per second. Can tcollector keep up? I don't think so. Although the tcollector has knowledge of multple TSDs the tcollector seems to establish a single connection and pump data to that connection. I did not see any tests on the output queue to determine if the tcollector is not keeping up with data coming in and establishing additional connections. This may be where a modification will be required.
In researching, I found a good intercept point in the dbcache.c zabbix code (in the lib directory I believe). If you take a look at the history.glueon fork done by the Asian crew, you can see exactly where that intercept is.
With a current rate of ~4-5k values per second (looking historically we have bursts up to ~40k -- curious if there's any recommended methods for testing how that will handle without having to go through both systems?);Keeping an eye on tcollector.* metrics (internal / self reporting on the tcollector; specifically lines_dropped looks like could be a signal) everything looks ok (... still testing with full dataset/load).
In researching, I found a good intercept point in the dbcache.c zabbix code (in the lib directory I believe). If you take a look at the history.glueon fork done by the Asian crew, you can see exactly where that intercept is.The method referenced in the ^ PR uses the MySQL replication to connect (vs. a patch on the Zabbix server process); additionally the in memory map (itemid:host) seems to be generally applicable to vanilla Zabbix installs. However the miraclelinux/HistoryGluon:MIRACLE-ZBX-2.0.3-NoSQL certainly looks like it provides not only the entry point in extending the Zabbix server but also a nice path forward to potentially using the full resolution metrics in TSDB to display on the frontend.This strategy looks like it could be a nice method to shed some responsibility off Zabbix (where it might be entrenched or a perfectly acceptable solution).
My team recently have made some updates to allow very high speed ingestion into Open TSDB. On four machines in a 10 node cluster we were able to sustain >100 million data point per second ingest rate.I think that would meet your need for speed.As part of this work, we added a bulk insert entry point. There isn't a command line way to hit that point yet, but it might serve your needs if there were.We have a PR submitted but haven't seen any action on this yet.
My team recently have made some updates to allow very high speed ingestion into Open TSDB. On four machines in a 10 node cluster we were able to sustain >100 million data point per second ingest rate.I think that would meet your need for speed.As part of this work, we added a bulk insert entry point. There isn't a command line way to hit that point yet, but it might serve your needs if there were.We have a PR submitted but haven't seen any action on this yet.