highspeed, high metric count collectors and tcollector

857 views
Skip to first unread message

Paula Keezer

unread,
Mar 9, 2013, 1:39:14 PM3/9/13
to open...@googlegroups.com
Hi All,  A little background, we have bee using openTSDB for about 1.5 years in a mode set up that pulled data from zabbix servers and pushed data into openTSDB using import.  Typical rates were 500 to 700 metrics per seconds,   We are now being challenged to move upwards of 10,000 metrics per seconds from zabbix into openTSDB.  I have built a collector that takes a feed from zabbix, maps zabbix meta data to tags and produces an openTSDB input string.  It can process 40K plus metrics per second.  My current challenges regarding the tcollector are:

1) Is there a limit to the number of different metrics a tcollector can handle (eg typical collectors handle under 20 individual metrics, my collector handels many thousands of individual metrics.

2) Since my collector can process 40K plus metrics per seconds, how do I connect my collector to multiple tcollectors.

3) Should I consider bypassing the tcollector completely and push directly into tsd's

Basically the question comes down to 'Is the tcollector capable of handling my throughput or should I be working on my own highspeed tsd push technology?'

Thanks for your thoughts in advance!


Paula Keezer

unread,
Mar 10, 2013, 2:01:13 PM3/10/13
to open...@googlegroups.com
Hi All,  looks like no one has run into this particular scenario so I will post what I find as I go.  Comments from Tcollector guru's are welcome!.  
I started looking at the tcollector code today and my questions lead to the following findings:

First,  the tcollector is written in python.  As my zabbix collector is also written in python, this makes the transition in thought pretty straight forward.
1) Is there a limit to the number of metrics:  The simple answer is no. However, the python dictionary used for de-duping will get large (as large as my zabbix meta data mapper!).  I don't see a serious problem here.  In my case,  de-duping may not be necessary as zabbix has already done much of this work (by reducing the sample rate on non changing data).

2) My collector can process 40K+ samples per second.  Can tcollector keep up?  I don't think so.  Although the tcollector has knowledge of multple TSDs the tcollector seems to establish a single connection and pump data to that connection.   I did not see any tests on the output queue to determine if the tcollector is not keeping up with data coming in and establishing additional connections.  This may be where a modification will be required.

3) Should I consider bypassing the tcollector and building my own high speed push to tsd's.  This is where I think I am heading.  For instance,  with my zabbix collector I already ensure clean data and I can build in de-duping with the cost of two additional tests in my high-speed loop and send clean de-duped tsd format data. .  This will have a small impact on my 40K plus throughput.  I also think I can pull the tsd connection code into a separate process that manages multiple tsd connections and monitors output queue for growth or shrinkage and increase or reduce the number of tsd connections servicing the queue.  The result would be a high speed multi output tcollector. 

Thanks,  P

Jonathan Creasy

unread,
Mar 10, 2013, 2:25:28 PM3/10/13
to Paula Keezer, open...@googlegroups.com

Your conclusions sound correct.

If I may ask, what is the context that has you duplicating Zabbix data into OpenTSDB? This seems like a strange thing to be doing.

Paula Keezer

unread,
Mar 10, 2013, 3:17:08 PM3/10/13
to open...@googlegroups.com, Paula Keezer, jcr...@box.com, j...@box.com
We have an extensive monitoring structure in place with zabbix, crossing many data centers and 10 of thousands of machines. The monitoring structure is 'designed in' to several operations groups so changing out Zabbix is not an option.  Zabbix cannot keep data longer then a couple of weeks and it does a very poor job at aggregation (across meta data, like tags) causing lots of down stream processing and, of course, no long term data.  We are under a direction to build an archive that can store high resolution data for at least 1 year, preferably 2 years.  We do not want to deploy and manage another set of agents on all the machines (we already have a group and infrastructure that does this).  Our shortest and best option right now is to 'intercept' zabbix data, map zabbix meta data to the intercepted data and push into a real TSData data base.  We have been doing this with openTSDB for over a year for selected metrics (required by our BI team).  Management has now upped the requirements.  As I mentioned, we now can accept and map zabbix TSdata to zabbix meta data and produce an openTSDB string at 40K+ per seconds. 

btw, thanks for looking at my conclusions.  I will be proceeding down that path.

P

Andrey Stepachev

unread,
Mar 11, 2013, 4:37:22 AM3/11/13
to Paula Keezer, open...@googlegroups.com
> 3) Should I consider bypassing the tcollector and building my own high speed push to tsd's.

I was done exactly that. Now I use flume for data movement and my flume sink for putting data into opentsdb.
Right now TSDB is tend to OOM, due of the fact, that it don't cap it's memory usage.

--
Andrey.

ManOLamancha

unread,
Mar 12, 2013, 5:12:14 PM3/12/13
to open...@googlegroups.com, Paula Keezer, jcr...@box.com, j...@box.com
On Sunday, March 10, 2013 3:17:08 PM UTC-4, Paula Keezer wrote:
We have an extensive monitoring structure in place with zabbix, crossing many data centers and 10 of thousands of machines. The monitoring structure is 'designed in' to several operations groups so changing out Zabbix is not an option.  Zabbix cannot keep data longer then a couple of weeks and it does a very poor job at aggregation (across meta data, like tags) causing lots of down stream processing and, of course, no long term data.  We are under a direction to build an archive that can store high resolution data for at least 1 year, preferably 2 years.  We do not want to deploy and manage another set of agents on all the machines (we already have a group and infrastructure that does this).  Our shortest and best option right now is to 'intercept' zabbix data, map zabbix meta data to the intercepted data and push into a real TSData data base.  We have been doing this with openTSDB for over a year for selected metrics (required by our BI team).  Management has now upped the requirements.  As I mentioned, we now can accept and map zabbix TSdata to zabbix meta data and produce an openTSDB string at 40K+ per seconds.

We have exactly the same situation in that with 10s of thousands of hosts running through our Zabbix proxies, we can't keep history data so for now we've deployed CollectD to our edge machines. We'll also look into LinkedIn's http://data.linkedin.com/projects/databus to see if we could stream data out of our Zabbix DBs directly in to OpenTSDB but if you could open source something, I bet a lot of Zabbix users would be interested (including us).

Though I haven't played with Tcollector yet, you're best bet is to push directly to the TSDs. A single TSD can easily handle 40k+ puts per second. Our dev cluster 4 HBase nodes on machines with 8GB of RAM and a TSD running on just one of them with 2GB of heap configured and we had it pushing 60K per second.

For long-term storage, just make sure to enable compression in HBase and Compactions in the TSDs.

ManOLamancha

unread,
Mar 12, 2013, 5:14:45 PM3/12/13
to open...@googlegroups.com, Paula Keezer
On Monday, March 11, 2013 4:37:22 AM UTC-4, Andrey Stepachev wrote:
> 3) Should I consider bypassing the tcollector and building my own high speed push to tsd's.

I was done exactly that. Now I use flume for data movement and my flume sink for putting data into opentsdb.
Right now TSDB is tend to OOM, due of the fact, that it don't cap it's memory usage.

Do you know what part of TSD was OOMing out?  I don't think it should run out of heap unless HBase is backed up or splitting regions. You could try a "jmap -histo:live <pid>" to see what's eating up the most space.

Paula Keezer

unread,
Mar 13, 2013, 11:25:18 AM3/13/13
to open...@googlegroups.com
Chris, what kind of hardware are you running where you can get 60K samples on a single tsd.  I have not done any testing yet.  Tsuna's docs indicate a single core modest speed instance can easily handle 2K.  I also understand Tsuna has worked on that part extensively since he first wrote his doc.  Today, I will be lifting code and setting up the zabbix collector to pass through a 'SendThread' class.

We looked at intercepting zabbix data at the proxies but the zabbix itemId is not attached to the data at that point.  To get all the zabbix metadata attached, we need the zabbix itemId.  In researching, I found a good intercept point in the dbcache.c zabbix code (in the lib directory I believe).  If you take a look at the history.glueon fork done by the Asian crew, you can see exactly where that intercept is.

The challenge with zabbix meta data is in the mapping recipes   Different zabbix installations would required different recipe code.  I'll look at building a class that can be inherited so developers can build their on recipes for handling the difficult stuff like zabbix key names, zabbix key parameters and zabbix groups.  All require some thought and possible 'policies' that zabbix must follow. (Groups were a particularly difficult concept to master until we established a 'naming convention' that the zabbix folk must follow)

ManOLamancha

unread,
Mar 13, 2013, 1:15:21 PM3/13/13
to open...@googlegroups.com
On Wednesday, March 13, 2013 11:25:18 AM UTC-4, Paula Keezer wrote:
Chris, what kind of hardware are you running where you can get 60K samples on a single tsd.  I have not done any testing yet.  Tsuna's docs indicate a single core modest speed instance can easily handle 2K.  I also understand Tsuna has worked on that part extensively since he first wrote his doc.  Today, I will be lifting code and setting up the zabbix collector to pass through a 'SendThread' class.

The little dev cluster has four machines with:

1 Xeon X5355 2.6GHz, 4 core
Super Micro X7DVL-3
8GB @667MHz
4 146GB SAS 15k disks, one for system, 3 for HDFS as JBOD
Cloudera cdh4.1.2

Some caveats:
  • The UIDs had all be assigned so there weren't any new timeseries coming in. UID creation is blocking so it can slow down a TSD quite a bit
  • The regions had been split and were load balanced across the 4 region servers
  • We were pushing data via HTTP with about 50 data points per request (this will all be 2.0) so we didn't even test the massive speed increase Tsuna and folks implemented
  • And since the TSD ran on a box along with a region-server (not recommended for max throughput) the CPU was pretty close to being maxed out.
  • The nameserver and zookeeper run on node #1 in our cluster, TSD runs on #4
Our prod cluster has 8 region server nodes:

2x E5-2640 2.5GHz, 6 core
Super Micro X9DRT-HF+
64GB @1033MHz
12 1TB SAS 7200 disks, 2 mirrored for system, 10 for HDFS as JBOD
Cloudera cdh4.1.2

with separate boxes for the TSD front-end, nameservers and zookeeper.

We looked at intercepting zabbix data at the proxies but the zabbix itemId is not attached to the data at that point.  To get all the zabbix metadata attached, we need the zabbix itemId.  In researching, I found a good intercept point in the dbcache.c zabbix code (in the lib directory I believe).  If you take a look at the history.glueon fork done by the Asian crew, you can see exactly where that intercept is.

The challenge with zabbix meta data is in the mapping recipes   Different zabbix installations would required different recipe code.  I'll look at building a class that can be inherited so developers can build their on recipes for handling the difficult stuff like zabbix key names, zabbix key parameters and zabbix groups.  All require some thought and possible 'policies' that zabbix must follow. (Groups were a particularly difficult concept to master until we established a 'naming convention' that the zabbix folk must follow)

Sounds really cool.

Paula Keezer

unread,
Mar 16, 2013, 2:50:15 PM3/16/13
to open...@googlegroups.com
Hello TSD gurus!!! ,  I am rapidly approaching the potential single high speed collector, multiple TSD model.  In this model, I have a single collector sinking 40K plus samples/second accross 2K different metrics.  I have plenty of resources to start up multiple TSD's.  SenderThreads are created for each TSD (lets say two for now).  As I sink my 40K named pipe, I assemble/move my 'put' statements into a 'SenderThread sendq'.  When I reach 4K lines, I unlock that 'Sender Thread' and start filling up my second SenderThread.sendq.  Both sendq s can have the same metrics but different timestamp/value lines. 

Here is the potential problem.  sendq1 gets filled first and the Sender thread is unlocke and is sending it to the TSD1.  sendq2  fills up (partially because we are fast enough) and its sender thread sends to TSD2 while TSD1 is still sinking 'puts'.   The concerning condition is: 
1) there is a sample of metricA at time=0 at the end of sendq1   and  TSD1 has not sunk it yet.
2) there is a sample of metricA at time=time+1 at the beginning of sendq2 and TSD2 has already processed  it (a sample out of sequence).

Will TSD1 throw  an 'out of sequence error' when it reaches metricA time=0? or will the TSD1 simply put it in the correct row position?

I ask this question because the openTSDB  'import' command will complain if provided with an out of sequence (or equal value) time for the same metric/tags sample during a single import.  multiiple imports that over wright or 'fill in' seem to work fine.

I will likely find out soon enough, but any knowledge that could be shared might save me some design cycles :-)

Paula



On Saturday, March 9, 2013 10:39:14 AM UTC-8, Paula Keezer wrote:

tsuna

unread,
Mar 22, 2013, 5:14:18 PM3/22/13
to Paula Keezer, open...@googlegroups.com
On Sat, Mar 16, 2013 at 11:50 AM, Paula Keezer <paula....@gmail.com> wrote:
> Will TSD1 throw an 'out of sequence error' when it reaches metricA time=0?
> or will the TSD1 simply put it in the correct row position?

No it won't complain and it will do the right thing. But this is
suboptimal because now both your TSDs will have written data to the
same row, and they will both attempt to recompact that row. The
double recompaction is harmless, but it increases the amount of work
that TSD and HBase have to do.

A better strategy may be to have two distribute the data across your
two queues based on some hash function. Hash the metric name and set
of tags, and then apply a "modulo N" on the hash, where N is the
number of queues you have. This way the same time series will always
wind up in the same queue, until there is a failure of one of the
queues, in which case you can either keep the queue as-is and attempt
to reconnect or simply flush the queue to another TSD by opening a new
connection.

--
Benoit "tsuna" Sigoure

Chris Christensen

unread,
Aug 15, 2014, 2:58:39 PM8/15/14
to open...@googlegroups.com, paula....@gmail.com
(Apologies for opening an old thread but this seems topical)

(Same env. details / setup ManOLamancha mentioned)


1) Is there a limit to the number of metrics:  The simple answer is no. However, the python dictionary used for de-duping will get large (as large as my zabbix meta data mapper!).  I don't see a serious problem here.  In my case,  de-duping may not be necessary as zabbix has already done much of this work (by reducing the sample rate on non changing data).

I used the tcollector parameters to basically disable dedup (b/c as you mentioned it doesn't seem particularly necessary for data coming through Zabbix)
./tcollector.py --dedup-interval=0 --evict-interval=1 -c ./collectors ...

2) My collector can process 40K+ samples per second.  Can tcollector keep up?  I don't think so.  Although the tcollector has knowledge of multple TSDs the tcollector seems to establish a single connection and pump data to that connection.   I did not see any tests on the output queue to determine if the tcollector is not keeping up with data coming in and establishing additional connections.  This may be where a modification will be required.
 
With a current rate of ~4-5k values per second (looking historically we have bursts up to ~40k -- curious if there's any recommended methods for testing how that will handle without having to go through both systems?);
Keeping an eye on tcollector.* metrics (internal / self reporting on the tcollector; specifically lines_dropped looks like could be a signal) everything looks ok (... still testing with full dataset/load).

In researching, I found a good intercept point in the dbcache.c zabbix code (in the lib directory I believe). If you take a look at the history.glueon fork done by the Asian crew, you can see exactly where that intercept is.

The method referenced in the ^ PR uses the MySQL replication to connect (vs. a patch on the Zabbix server process); additionally the in memory map (itemid:host) seems to be generally applicable to vanilla Zabbix installs. However the miraclelinux/HistoryGluon:MIRACLE-ZBX-2.0.3-NoSQL certainly looks like it provides not only the entry point in extending the Zabbix server but also a nice path forward to potentially using the full resolution metrics in TSDB to display on the frontend.


This strategy looks like it could be a nice method to shed some responsibility off Zabbix (where it might be entrenched or a perfectly acceptable solution).

The information in this message may be confidential.  It is intended solely for
the addressee(s).  If you are not the intended recipient, any disclosure,
copying or distribution of the message, or any action or omission taken by you
in reliance on it, is prohibited and may be unlawful.  Please immediately
contact the sender if you have received this message in error.

ManOLamancha

unread,
Aug 19, 2014, 7:25:12 PM8/19/14
to open...@googlegroups.com, paula....@gmail.com
On Friday, August 15, 2014 11:58:39 AM UTC-7, Chris Christensen wrote: 
With a current rate of ~4-5k values per second (looking historically we have bursts up to ~40k -- curious if there's any recommended methods for testing how that will handle without having to go through both systems?);
Keeping an eye on tcollector.* metrics (internal / self reporting on the tcollector; specifically lines_dropped looks like could be a signal) everything looks ok (... still testing with full dataset/load).

Long-time-no-see ;)  You could point the tcollector at a tsddrain.py instance to avoid the TSD portion and that should give you a good idea of how much throughput the tcollector can handle. TSDs won't have a problem with 40kps.

In researching, I found a good intercept point in the dbcache.c zabbix code (in the lib directory I believe). If you take a look at the history.glueon fork done by the Asian crew, you can see exactly where that intercept is.

The method referenced in the ^ PR uses the MySQL replication to connect (vs. a patch on the Zabbix server process); additionally the in memory map (itemid:host) seems to be generally applicable to vanilla Zabbix installs. However the miraclelinux/HistoryGluon:MIRACLE-ZBX-2.0.3-NoSQL certainly looks like it provides not only the entry point in extending the Zabbix server but also a nice path forward to potentially using the full resolution metrics in TSDB to display on the frontend.


This strategy looks like it could be a nice method to shed some responsibility off Zabbix (where it might be entrenched or a perfectly acceptable solution).

Looks good to me, I don't work on the tcollector stuff but I could ping Vasilly to see if he could take a look. 

Paula Keezer

unread,
Sep 1, 2014, 11:58:09 AM9/1/14
to ManOLamancha, open...@googlegroups.com
Just a note.  we have been running that patch in Zabbix as the intercept point for over a year.  Now collecting from 4 Zabbix servers, each delivering 20k to 40k metrics per second. 4B metrics per day.  The history data at that point does, in fact, have sub-second data.  We are not using that feature yet.  Final note, no meta data from Zabbix is available at that intercept.  Zabbix meta data (keyed by the itemid) must be gathered separately and combined prior to sending into the tsd.

Ted Dunning

unread,
Sep 1, 2014, 6:50:52 PM9/1/14
to open...@googlegroups.com, clars...@gmail.com
My team recently have made some updates to allow very high speed ingestion into Open TSDB.  On four machines in a 10 node cluster we were able to sustain >100 million data point per second ingest rate.

I think that would meet your need for speed.

As part of this work, we added a bulk insert entry point.  There isn't a command line way to hit that point yet, but it might serve your needs if there were.

We have a PR submitted but haven't seen any action on this yet.

Mathias Herberts

unread,
Sep 2, 2014, 3:32:15 AM9/2/14
to open...@googlegroups.com, clars...@gmail.com


On Tuesday, September 2, 2014 12:50:52 AM UTC+2, Ted Dunning wrote:
My team recently have made some updates to allow very high speed ingestion into Open TSDB.  On four machines in a 10 node cluster we were able to sustain >100 million data point per second ingest rate.

I think that would meet your need for speed.

As part of this work, we added a bulk insert entry point.  There isn't a command line way to hit that point yet, but it might serve your needs if there were.

We have a PR submitted but haven't seen any action on this yet.


Hi Ted,

which PR is that?  

Ted Dunning

unread,
Sep 2, 2014, 8:51:10 PM9/2/14
to open...@googlegroups.com, clars...@gmail.com

ManOLamancha

unread,
Sep 2, 2014, 9:44:30 PM9/2/14
to open...@googlegroups.com, clars...@gmail.com
On Monday, September 1, 2014 3:50:52 PM UTC-7, Ted Dunning wrote:
My team recently have made some updates to allow very high speed ingestion into Open TSDB.  On four machines in a 10 node cluster we were able to sustain >100 million data point per second ingest rate.

I think that would meet your need for speed.

As part of this work, we added a bulk insert entry point.  There isn't a command line way to hit that point yet, but it might serve your needs if there were.

We have a PR submitted but haven't seen any action on this yet.

Damn, 100Mps? Is that a consistent rate or is it just a computation off the bulk-import time? I'll definitely get to your PR soon. Thanks!

Ted Dunning

unread,
Sep 2, 2014, 11:55:22 PM9/2/14
to ManOLamancha, open...@googlegroups.com
We are targeting 1 M point / s for steady state.  That implies 100x that rate for bulk loading of the backlog testing data, without the acceleration, you can't test at scale (must take << 2 years to load 2 years of data).

The code path for the bulk ingest writes directly into HBase or MapR DB using Open TSDB code.  For sustained high performance, we put a memory buffer into TSD so that all TSDB compaction happens in memory.  Data queries are satisfied from both database and memory buffer.

At that point, the primary limitation is the REST API.  We have added a bulk API to help with that, but further changes are going to be necessary as well.

This also changes the architecture of the system so that the TSD's are no longer stateless and restarts may lose data.  Our plan for that is to record a replay log into a persistent log file.  On restarts, the memory image can be reloaded from the log.  Our tentative plan for having multiple TSD's is to have each write a separate log which all other TSD's read.  This this gives us some nice guarantees (subject to reduced consistency due to delay) but limits scalability.  

Our thesis is that these limits will be acceptable in the near term and are a viable trade-off in return for 100x or more performance increase over vanilla operation.  We have customers recording a bit over 1 M sample / sec, but none above 10 M sample / s.




Reply all
Reply to author
Forward
0 new messages