Single node corrupt HFile problems

115 views
Skip to first unread message

Izak Marais

unread,
Apr 14, 2016, 9:41:09 AM4/14/16
to OpenTSDB
Hi All,

We run single node OpenTSDB with HBase writing to local file (RAID backed) in stead of HDFS (as recommended for our scale). OpenTSDB easily handles the ingestion rate (about 7000 dps). 

However we have had repeated file level corruption problems. Over the last few months our 2 test systems have 5 times had an HBase 'tsdb' region is stuck in a FAILED_OPEN state. The only way I could recover from this is to delete the region file from the disk. 

Is there something we can improve in our setup to avoid these errors? I am thinking about moving to HDFS. Is it possible/worth while to run  a single node HDFS (with mulitple JBOD disks for reliability). 

Any advise will be greatly appreciated!

Thanks
Izak

More details about diagnosing the problem, which is possible from the HBase WebGUI:

The Hbase logs will have errors like:

2016-04-14 11:50:34,495 ERROR [RS_OPEN_REGION-sea-badger:60020-2] handler.OpenRegionHandler: Failed open of region=tsdb,\x00\x07\xD8V\xF7\xAF \x00\x00\x01\x00\x03\x05\x00\x00\x02\x00\x02\xF3\x00\x00\x04\x00\x00\x10\x00\x00\x0C\x00\x03\x08\x00\x00&\x00\x00\x8A\x00\x00'\x00\x00\x8A\x00\x00(\x00\x03\x09\x00\x00=\x00\x00\xBC,1459245586846.909c9397f7105f3141ce8a5dcea6b8c4., starting to roll back the global memstore size.
java.io.IOException: java.io.IOException: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file file:/data/hbase/hbase/data/default/tsdb/909c9397f7105f3141ce8a5dcea6b8c4/t/746cdacd07844815af8a46e1bf9dd19a
        at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:832)
        ...



 

John A. Tamplin

unread,
Apr 14, 2016, 9:52:32 AM4/14/16
to Izak Marais, OpenTSDB
On Thu, Apr 14, 2016 at 9:41 AM, 'Izak Marais' via OpenTSDB <open...@googlegroups.com> wrote:
We run single node OpenTSDB with HBase writing to local file (RAID backed) in stead of HDFS (as recommended for our scale). OpenTSDB easily handles the ingestion rate (about 7000 dps). 

However we have had repeated file level corruption problems. Over the last few months our 2 test systems have 5 times had an HBase 'tsdb' region is stuck in a FAILED_OPEN state. The only way I could recover from this is to delete the region file from the disk. 

Is there something we can improve in our setup to avoid these errors? I am thinking about moving to HDFS. Is it possible/worth while to run  a single node HDFS (with mulitple JBOD disks for reliability). 

I think running HBase on a regular file is not expected to be reliable - even beyond hardware errors, if the software crashes there is no protection against corruption.  I think if you are putting real data in it, you need to be using HDFS.

--
John A. Tamplin

Izak Marais

unread,
Apr 15, 2016, 1:30:00 AM4/15/16
to OpenTSDB, izakm...@yahoo.com
Thanks for confirming my suspicion. 

Jonathan Creasy

unread,
Apr 15, 2016, 8:45:48 AM4/15/16
to Izak Marais, OpenTSDB

Feel free to open an issue, something like

"OpenTSDB should reliably operate on a single node"

It is something we discussed while looking at roadmap items and should be supportable once we have abstracted the storage layer better.

Izak Marais

unread,
May 19, 2016, 1:21:34 AM5/19/16
to OpenTSDB, izakm...@yahoo.com
Reply all
Reply to author
Forward
0 new messages