Broken HBASE + OpenTSDB ( Help needed !!)

894 views
Skip to first unread message

Toni Moreno

unread,
Mar 27, 2012, 3:09:29 AM3/27/12
to open...@googlegroups.com
Hi guys.

I have a working hbase  0.92.0 and OpenTSDB 1.1.0 installation and I've been collecting  about 500 metrics by minute since  1 week ago. Suddenly at the middle of the week my HBASE and TSD process seemed freezed and I rebooted all the  processes by kill them.

Tonight the problem happens again and now I'm not able to restart them. When I try it to restart , ¿ How can I  fix this error ?

Thanks a lot!!!

TSD log show:

2012-03-27 08:51:28,626 INFO  [main-SendThread(localhost:2181)] ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
2012-03-27 08:51:28,678 INFO  [main-SendThread(localhost:2181)] ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x13652e513830005, negotiated timeout = 5000
2012-03-27 08:51:28,706 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!

2012-03-27 08:51:29,715 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!
2012-03-27 08:51:30,733 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!
2012-03-27 08:51:31,756 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!
2012-03-27 08:51:32,775 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!
2012-03-27 08:51:33,793 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!
2012-03-27 08:51:34,815 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!



And HBASE log shows:

2012-03-27 08:51:07,544 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=file:/opt/hbase/data/splitlog/dwilyast02,64391,1332830608263_file%3A%2Fopt%2Fhbase%2Fdata%2F.logs%2Fdwilyast02%2C55897%2C1332401896263-splitting%2Fdwilyast02%252C55897%252C1332401896263.1332650381423/tsdb/c332a6033e280b786219866513f45fe1/recovered.edits/0000000000000181211.temp region=c332a6033e280b786219866513f45fe1
2012-03-27 08:51:07,640 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[630, 630]=
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/opt/hbase/data/.logs/dwilyast02,55897,1332401896263-splitting/dwilyast02%2C55897%2C1332401896263.1332650381423 at 3668992
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1988)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1888)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1934)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:206)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:180)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getNextLogLine(HLogSplitter.java:789)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:407)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:351)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:266)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
        at java.lang.Thread.run(Thread.java:662)

2012-03-27 08:59:10,796 WARN org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting of [dwilyast02,55897,1332401896263]
java.io.IOException: error or interrupt while splitting logs in [file:/opt/hbase/data/.logs/dwilyast02,55897,1332401896263-splitting] Task = installed = 1 done = 0 error = 1
        at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:268)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:276)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:216)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:487)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:326)
        at org.apache.hadoop.hbase.master.HMasterCommandLine$LocalHMaster.run(HMasterCommandLine.java:218)
        at java.lang.Thread.run(Thread.java:662)





tsuna

unread,
Mar 29, 2012, 9:07:03 PM3/29/12
to Toni Moreno, open...@googlegroups.com
On Tue, Mar 27, 2012 at 12:09 AM, Toni Moreno <toni....@gmail.com> wrote:
> 2012-03-27 08:51:07,640 INFO org.apache.hadoop.fs.FSInputChecker: Found
> checksum error: b[630, 630]=
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> file:/opt/hbase/data/.logs/dwilyast02,55897,1332401896263-splitting/dwilyast02%2C55897%2C1332401896263.1332650381423
> at 3668992

Looks like you have corrupted data. You should send an email to the
HBase users mailing list to ask for help. This probably requires a
little surgery to get fixed.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Stack

unread,
Mar 29, 2012, 11:41:17 PM3/29/12
to Toni Moreno, open...@googlegroups.com
On Tue, Mar 27, 2012 at 12:09 AM, Toni Moreno <toni....@gmail.com> wrote:
> 2012-03-27 08:51:07,544 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
> path=file:/opt/hbase/data/splitlog/dwilyast02,64391,1332830608263_file%3A%2Fopt%2Fhbase%2Fdata%2F.logs%2Fdwilyast02%2C55897%2C1332401896263-splitting%2Fdwilyast02%252C55897%252C1332401896263.1332650381423/tsdb/c332a6033e280b786219866513f45fe1/recovered.edits/0000000000000181211.temp
> region=c332a6033e280b786219866513f45fe1
> 2012-03-27 08:51:07,640 INFO org.apache.hadoop.fs.FSInputChecker: Found
> checksum error: b[630, 630]=
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> file:/opt/hbase/data/.logs/dwilyast02,55897,1332401896263-splitting/dwilyast02%2C55897%2C1332401896263.1332650381423
>

You are running on local filesystem or you have only a single replica
of your data? If so, a bit flipped and you have a checksum error
rereading the data. If there were replicas, e.g. you were running on
a distributed hdfs, the rotted block would be discarded, and a replica
used in its place.
St.Ack

bwann

unread,
Jul 10, 2012, 3:23:15 AM7/10/12
to open...@googlegroups.com
(I realize this thread is a few months old, but this may help somebody)

I ran into this exact same scenario a few weeks ago. Followed the quickie howto to setup a single node instance, TSD worked well. I rebooted, then all of a sudden TSD is unhappy because -ROOT- is gone because HBase was unhappy due to the exact same ChecksumException exception. I forget how graceful I shut the system down, if I stopped HBase first or did something silly and had to reset. Also hbck was of no help since the region wasn't online ("root region is null [...] fatal").

Anyways, from what I can tell in 0.92 (HBASE-1364) they introduced a distributed log splitting feature. Googling around for the java exception lead me to putting this in my base config file:

<property> <name>hbase.master.distributed.log.splitting</name> <value>false</value> </property>

After this, the region came online and TSD was happy. After spending a few minutes just now reading over the source and log files, it's not obvious to me what this did. I'm completely new to HBase so I don't know how things work under the hood. Without really diving in to say for sure, my guess is that there was a truncated log file caused by reboot and then when HBase started up and tried to rotate the file (which now failed a checksum), it couldn't; possibly this short-circuited something and let it move it out of the way. Having said that, this is apparently an important new performance feature and should really be left enabled according to the HBase docs. Or it all could be a red herring!


On Tuesday, March 27, 2012 12:09:29 AM UTC-7, Toni Moreno wrote:

I have a working hbase  0.92.0 and OpenTSDB 1.1.0 installation and I've been collecting  about 500 metrics by minute since  1 week ago. Suddenly at the middle of the week my HBASE and TSD process seemed freezed and I rebooted all the  processes by kill them.
 
2012-03-27 08:51:34,815 ERROR [main-EventThread] HBaseClient: The znode for the -ROOT- region doesn't exist!


joanka

unread,
Aug 16, 2013, 5:24:36 AM8/16/13
to open...@googlegroups.com
Thank you for sharing this solution. This is exactly what I was looking for .. for the last 24 hours after I killed the hmaster and couldn't get the hbase working again . Works perfectly now ;)
Reply all
Reply to author
Forward
0 new messages