HBase with opentsdb creates huge .tmp file & runs out of hdfs space

orange200

unread,

Feb 25, 2015, 8:55:50 AM2/25/15

to open...@googlegroups.com

Hi,

OpenTSDB v2.0 is running on top of Cloudera 5.3.1 in AWS. We have a 7 node Cloudera cluster(each node with 32GB ram and 3TB disk space), with 5 OpenTSDB instances dedicated for writing and 2 for reading. We are using AWS ELB’s in front of OpenTSDB to balance the read/writes.

After ingesting 200-300MB of data, hbase tries to compact the table, and ends up creating a .tmp file which grows to fill up the entire hdfs space.. and dies eventually. I tried to remove this .tmp file & restart hbase, but it goes back to creating this gigantic .tmp file & ends up dying again...

Here are the stack traces of the region server dumps when the huge .tmp files are created:

https://drive.google.com/open?id=0B1tQg4D17jKQNDdFZkFQTlg4ZjQ&authuser=0

As background we are not using compression. Compaction is occurs every hour. Everything else is default.

We are load testing OpenTSDB using SOCKETS, but running into several issues. Let me explain first how we do this load testing:

1.From another AWS system, we have written a testing framework to generate load.

2. The framework takes several parameters, we can specify the number of threads, the loop size (i.e. the number of sockets that each thread will open) and the batch size (i.e. the number of PUT’s, or inserts, that each socket connection will handle).

3. To simplify troubleshooting, we removed variables from the tests, we have just 1 OpenTSDB instance behind the AWS ELB so the load is being sent to 1 instance only.

4. We are initially creating the openTSDB tables without any pre-splitting of regions.

5. We are doing the loading with 1 metric only for ease of querying in the UI.

6. We are sending under 5000 inserts per second:

7. At the top of the hour, the row compaction kicks in and the region server is too busy so we lose data. it recovers the first time. But the 2nd hour, there is so much data presumably, that it doesn’t recover. To fix it, we have to restart cloudera, reboot the nodes, drop the tsdb tables and re-create them. Otherwise the .tmp file keeps growing until it fills the 3TB disks and the system is unresponsive.

8. We see problems with region splits happening under heavy load. We noted a code fix committed on Jan 11 for this but I presume that is not in RC2.1.

This is crossposted to the Hbase forum here http://apache-hbase.679495.n3.nabble.com/template/NamlServlet.jtp?macro=edit_post&node=4068627

Thanks

wangja...@gmail.com

unread,

Apr 2, 2015, 3:13:34 AM4/2/15

to open...@googlegroups.com

hello orange2000

What is root cause? I met the same problem.

orange200於 2015年2月25日星期三 UTC+8下午9時55分50秒寫道：

ManOLamancha

unread,

Apr 3, 2015, 8:22:27 PM4/3/15

to open...@googlegroups.com

On Wednesday, February 25, 2015 at 5:55:50 AM UTC-8, orange200 wrote:

Hi,

OpenTSDB v2.0 is running on top of Cloudera 5.3.1 in AWS. We have a 7 node Cloudera cluster(each node with 32GB ram and 3TB disk space), with 5 OpenTSDB instances dedicated for writing and 2 for reading. We are using AWS ELB’s in front of OpenTSDB to balance the read/writes.

After ingesting 200-300MB of data, hbase tries to compact the table, and ends up creating a .tmp file which grows to fill up the entire hdfs space.. and dies eventually. I tried to remove this .tmp file & restart hbase, but it goes back to creating this gigantic .tmp file & ends up dying again...

That sounds like a bug in Cloudera's code or HBase itself so you may want to check out those mailing lists. TSDB shouldn't be able to cause that. I'll see if I can fire up an instance this weekend though.

Here are the stack traces of the region server dumps when the huge .tmp files are created:

https://drive.google.com/open?id=0B1tQg4D17jKQNDdFZkFQTlg4ZjQ&authuser=0

As background we are not using compression. Compaction is occurs every hour. Everything else is default.

We are load testing OpenTSDB using SOCKETS, but running into several issues. Let me explain first how we do this load testing:

1.From another AWS system, we have written a testing framework to generate load.

2. The framework takes several parameters, we can specify the number of threads, the loop size (i.e. the number of sockets that each thread will open) and the batch size (i.e. the number of PUT’s, or inserts, that each socket connection will handle).

Make sure you're consuming from the socket too, TSDs can OOM if their write buffer fills up with exceptions being written back to the socket.

3. To simplify troubleshooting, we removed variables from the tests, we have just 1 OpenTSDB instance behind the AWS ELB so the load is being sent to 1 instance only.

4. We are initially creating the openTSDB tables without any pre-splitting of regions.

5. We are doing the loading with 1 metric only for ease of querying in the UI.

6. We are sending under 5000 inserts per second:

7. At the top of the hour, the row compaction kicks in and the region server is too busy so we lose data. it recovers the first time. But the 2nd hour, there is so much data presumably, that it doesn’t recover. To fix it, we have to restart cloudera, reboot the nodes, drop the tsdb tables and re-create them. Otherwise the .tmp file keeps growing until it fills the 3TB disks and the system is unresponsive.

For now, just disable the compactions in OpenTSDB. A number of folks have run into this issue where the reads, writes and deletes put a fair amount of load on their HBase instances. We've had some success with appends but they eat up CPU on the HBase servers. So it's a trade off between space, CPU and network. We may need a copressor to handle compactions in a more efficient manner.

8. We see problems with region splits happening under heavy load. We noted a code fix committed on Jan 11 for this but I presume that is not in RC2.1.

What's the code fix you saw?

This is crossposted to the Hbase forum here http://apache-hbase.679495.n3.nabble.com/template/NamlServlet.jtp?macro=edit_post&node=4068627

Thank you!

伍照坤

unread,

Apr 11, 2015, 9:43:18 AM4/11/15

to open...@googlegroups.com

We found the root cause.

1. We use the milliseconds in opentsdb metrics, which may generate over 32000 metrics in one hour. each column of milliseconds metrics uses 4 bytes, when compaction, the integrated column size may exceed the 128KB (hfile.index.block.max.size)

2.If the size of ( rowkey + columnfamily:qualifer ) > hfile.index.block.max.size. this may cause the memstore flush to infinite loop during writing the hifle index.

That's why the compaction hangs, and the tmp folder of regions on hdfs increases all the time. and makes the region server down.

On Wednesday, February 25, 2015 at 5:55:50 AM UTC-8, orange200 wrote:

Nick Dimiduk

unread,

Apr 14, 2015, 12:42:26 AM4/14/15

to 伍照坤, open...@googlegroups.com

That's good info. Can you bring it up over on the HBase dev list? Maybe express it as a unit test? We'll get it fixed pronto!

Thanks,

Nick

Jeremy Truelove

unread,

Apr 14, 2015, 1:03:50 PM4/14/15

to open...@googlegroups.com, tony...@gmail.com

Also these docs need updating http://opentsdb.net/docs/build/html/user_guide/writing.html#telnet to tell writers that they need to read back data. Because right now there's no way to know that. I think we've been seeing issues where we succeed to write but are losing metrics and it's probably due to the workers OOM on their outbound write buffers. What is the format of the errors written back?

ManOLamancha

unread,

Apr 23, 2015, 9:41:40 PM4/23/15

to open...@googlegroups.com, tony...@gmail.com

On Tuesday, April 14, 2015 at 10:03:50 AM UTC-7, Jeremy Truelove wrote:

Also these docs need updating http://opentsdb.net/docs/build/html/user_guide/writing.html#telnet to tell writers that they need to read back data. Because right now there's no way to know that. I think we've been seeing issues where we succeed to write but are losing metrics and it's probably due to the workers OOM on their outbound write buffers. What is the format of the errors written back?

The format is just the exception raised by HBase or illegal data. Unfortunately with Telnet you have no way to associate the error with the data point. Instead, it's better to use the HTTP interface where you can determine exactly which data point triggered an error. Or in 2.2 we'll have the error handling code that will let plugins requeue or temporarily spool data to disk.

Larry Reeder

unread,

Jun 10, 2015, 11:27:26 PM6/10/15

to open...@googlegroups.com

Is there a workaround for this that doesn't involve clearing all my existing data? When I start hbase, it creates a very large temp file and eventually uses all disk. I'm assuming it's the same cause, although OpenTSDB is not running at the time. I'm running hbase on a single node, and while I don't mind losing some data, I'd hate to delete it all.

Thanks...... Larry

Nick Dimiduk

unread,

Jun 17, 2015, 8:34:10 PM6/17/15

to Larry Reeder, opentsdb

For anyone interested in tracking this bug, there's an open ticket over on HBASE-13329. I spent a bit of time with it this week but was unable to reproduce the exact condition that triggers. I think there's more jiggering to be done in how the test creates rowkeys. I posted my WIP patch if anyone is curious.

-n

https://issues.apache.org/jira/browse/HBASE-13329

Larry Reeder

unread,

Jun 22, 2015, 11:51:55 AM6/22/15

to open...@googlegroups.com

For any others trying to recover without wiping all data, this worked for me. Caveat: I have no idea what I'm doing and your results may vary.

1. When HBase is starting, it will create that huge file in a .tmp directory in one of the subdirectories under the tsdb directory.
2. Shut down HBase.
3. Find the parent directory for the .tmp directory containing the huge file.
4. If the parent directory contains a directory called recovered.edits, delete the recovered.edits directory or rename it to something like recovered.edits.bak.
5. Remove the huge temp file to recover your disk.
6. Start HBase

Looks like I lost about a days worth of data doing this. Considering I have 18 months worth of data I wanted to keep, I'm happy with the outcome, but as I said your results may vary.

-Larry

Reply all

Reply to author

Forward