Hi,OpenTSDB v2.0 is running on top of Cloudera 5.3.1 in AWS. We have a 7 node Cloudera cluster(each node with 32GB ram and 3TB disk space), with 5 OpenTSDB instances dedicated for writing and 2 for reading. We are using AWS ELB’s in front of OpenTSDB to balance the read/writes.After ingesting 200-300MB of data, hbase tries to compact the table, and ends up creating a .tmp file which grows to fill up the entire hdfs space.. and dies eventually. I tried to remove this .tmp file & restart hbase, but it goes back to creating this gigantic .tmp file & ends up dying again...
Here are the stack traces of the region server dumps when the huge .tmp files are created:
https://drive.google.com/open?id=0B1tQg4D17jKQNDdFZkFQTlg4ZjQ&authuser=0
As background we are not using compression. Compaction is occurs every hour. Everything else is default.
We are load testing OpenTSDB using SOCKETS, but running into several issues. Let me explain first how we do this load testing:
1.From another AWS system, we have written a testing framework to generate load.
2. The framework takes several parameters, we can specify the number of threads, the loop size (i.e. the number of sockets that each thread will open) and the batch size (i.e. the number of PUT’s, or inserts, that each socket connection will handle).
3. To simplify troubleshooting, we removed variables from the tests, we have just 1 OpenTSDB instance behind the AWS ELB so the load is being sent to 1 instance only.
4. We are initially creating the openTSDB tables without any pre-splitting of regions.
5. We are doing the loading with 1 metric only for ease of querying in the UI.
6. We are sending under 5000 inserts per second:
7. At the top of the hour, the row compaction kicks in and the region server is too busy so we lose data. it recovers the first time. But the 2nd hour, there is so much data presumably, that it doesn’t recover. To fix it, we have to restart cloudera, reboot the nodes, drop the tsdb tables and re-create them. Otherwise the .tmp file keeps growing until it fills the 3TB disks and the system is unresponsive.
8. We see problems with region splits happening under heavy load. We noted a code fix committed on Jan 11 for this but I presume that is not in RC2.1.
This is crossposted to the Hbase forum here http://apache-hbase.679495.n3.nabble.com/template/NamlServlet.jtp?macro=edit_post&node=4068627
Also these docs need updating http://opentsdb.net/docs/build/html/user_guide/writing.html#telnet to tell writers that they need to read back data. Because right now there's no way to know that. I think we've been seeing issues where we succeed to write but are losing metrics and it's probably due to the workers OOM on their outbound write buffers. What is the format of the errors written back?