Interesting problem I'm running into. It appears that our master has died after a failed region assignment. Any ideas as to why?
We are running version 0.90.1 in this cluster. H3M1 = master H3S* = regionservers
out on H3S3 the regionserver process is running (but didn't respond per the log)
2011-06-27 10:18:22,517 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to disable table ContentGuidance 2011-06-27 10:18:22,522 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 regions. 2011-06-27 10:18:23,527 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table is done=true 2011-06-27 10:18:24,536 INFO org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table operation C_M_DELETE_TABLE on table ContentGuidance 2011-06-27 10:18:24,584 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Deleted region ContentGuidance,,1298676132327.9d63b6173c7307c27aa9728f738f0ef3. from META 2011-06-27 10:19:17,296 WARN org.apache.hadoop.hbase.zookeeper.ZKTable: Moving table ContentGuidance state to enabled but was already enabled 2011-06-27 10:19:17,313 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: HLog configuration: blocksize=64 MB, rollsize=60.8 MB, enabled=true, flushlogentries=1, optionallogflushinternal=1000ms 2011-06-27 10:19:17,349 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: New hlog /hbase/ContentGuidance/f38bbc9bc672b5ea415b6dcc7e1da43a/.logs/hlog.13091915 57313 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Using getNumCurrentReplicas--HDFS-826 2011-06-27 10:19:17,433 INFO org.apache.hadoop.hbase.regionserver.HRegion: Onlined ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a.; next sequenceid=1 2011-06-27 10:19:17,438 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added region ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. to META 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 87 on 60000.logSyncer exiting 2011-06-27 10:19:17,470 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1 region(s) round-robin across 4 server(s) 2011-06-27 10:19:17,472 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done 2011-06-27 10:19:17,480 INFO org.apache.hadoop.hbase.master.AssignmentManager: H3S4,60020,1308946657703 unassigned znodes=1 of total=1 2011-06-27 10:19:21,530 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to disable table ContentGuidance 2011-06-27 10:19:21,536 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 regions. 2011-06-27 10:19:22,543 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table is done=true 2011-06-27 10:19:23,544 INFO org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table operation C_M_DELETE_TABLE on table ContentGuidance 2011-06-27 10:19:23,551 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Deleted region ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. from META 2011-06-27 10:20:17,479 FATAL org.apache.hadoop.hbase.master.HMaster: Failed assignment of regions to serverName=H3S3,60020,1308946657608, load=(requests=0, regions=7, usedHeap=976, maxHeap=1011) java.net.SocketTimeoutException: Call to H3S3/10..x.x.x3:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 remote=H3S3/10.x.x.x:60020] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy7.openRegions(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.j ava:566) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.j ava:776) at org.apache.hadoop.hbase.master.AssignmentManager$SingleServerBulkAssigner.r un(AssignmentManager.java:1310) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.j ava:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 908) at java.lang.Thread.run(Thread.java:619) Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 remote=H3S3/10..x.x.x:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164 ) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBa seClient.java:299) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseCli ent.java:539) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477 ) 2011-06-27 10:20:17,482 INFO org.apache.hadoop.hbase.master.HMaster: Aborting 2011-06-27 10:20:18,390 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.CatalogJanitor: H3M1:60000-CatalogJanitor exiting 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.HMaster$1: H3M1:60000-BalancerChore exiting 2011-06-27 10:20:18,392 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60000
Yeah the BulkAssigner uses a AssignmentManager.assign method that won't retry if it gets an exception because originally it was only used during startup (eg if you can't assign the regions properly at startup then kill yourself).
In this case it happened to you when creating a table (it also uses the BulkAssigner now). To me it's a bug, it should just be assigned somewhere else. Mind filing a jira?
As to how you got a 60 secs timeout, you should take a look at the RS log (and next time it happens also do a jstack).
On Mon, Jun 27, 2011 at 9:54 AM, Matt Davies <matt.dav...@tynt.com> wrote: > Hey all,
> Interesting problem I'm running into. It appears that our master has died > after a failed region assignment. Any ideas as to why?
> We are running version 0.90.1 in this cluster. > H3M1 = master > H3S* = regionservers
> out on H3S3 the regionserver process is running (but didn't respond per the > log)
> 2011-06-27 10:18:22,517 INFO > org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to > disable table ContentGuidance > 2011-06-27 10:18:22,522 INFO > org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 > regions. > 2011-06-27 10:18:23,527 INFO > org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table > is done=true > 2011-06-27 10:18:24,536 INFO > org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table > operation C_M_DELETE_TABLE on table ContentGuidance > 2011-06-27 10:18:24,584 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Deleted region > ContentGuidance,,1298676132327.9d63b6173c7307c27aa9728f738f0ef3. from META > 2011-06-27 10:19:17,296 WARN org.apache.hadoop.hbase.zookeeper.ZKTable: > Moving table ContentGuidance state to enabled but was already enabled > 2011-06-27 10:19:17,313 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > HLog configuration: blocksize=64 MB, rollsize=60.8 MB, enabled=true, > flushlogentries=1, optionallogflushinternal=1000ms > 2011-06-27 10:19:17,349 INFO > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs > -- HDFS-200 > 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > New hlog > /hbase/ContentGuidance/f38bbc9bc672b5ea415b6dcc7e1da43a/.logs/hlog.13091915 57313 > 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Using getNumCurrentReplicas--HDFS-826 > 2011-06-27 10:19:17,433 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a.; > next sequenceid=1 > 2011-06-27 10:19:17,438 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Added region > ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. to META > 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Closed ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. > 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > IPC Server handler 87 on 60000.logSyncer exiting > 2011-06-27 10:19:17,470 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1 region(s) > round-robin across 4 server(s) > 2011-06-27 10:19:17,472 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done > 2011-06-27 10:19:17,480 INFO > org.apache.hadoop.hbase.master.AssignmentManager: H3S4,60020,1308946657703 > unassigned znodes=1 of total=1 > 2011-06-27 10:19:21,530 INFO > org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to > disable table ContentGuidance > 2011-06-27 10:19:21,536 INFO > org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 > regions. > 2011-06-27 10:19:22,543 INFO > org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table > is done=true > 2011-06-27 10:19:23,544 INFO > org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table > operation C_M_DELETE_TABLE on table ContentGuidance > 2011-06-27 10:19:23,551 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Deleted region > ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. from META > 2011-06-27 10:20:17,479 FATAL org.apache.hadoop.hbase.master.HMaster: Failed > assignment of regions to serverName=H3S3,60020,1308946657608, > load=(requests=0, regions=7, usedHeap=976, maxHeap=1011) > java.net.SocketTimeoutException: Call to H3S3/10..x.x.x3:60020 failed on > socket timeout exception: java.net.SocketTimeoutException: 60000 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 > remote=H3S3/10.x.x.x:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy7.openRegions(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.j ava:566) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.j ava:776) > at > org.apache.hadoop.hbase.master.AssignmentManager$SingleServerBulkAssigner.r un(AssignmentManager.java:1310) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.j ava:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.net.SocketTimeoutException: 60000 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 > remote=H3S3/10..x.x.x:60020] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164 ) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBa seClient.java:299) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseCli ent.java:539) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477 ) > 2011-06-27 10:20:17,482 INFO org.apache.hadoop.hbase.master.HMaster: > Aborting > 2011-06-27 10:20:18,390 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60000 > 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.CatalogJanitor: > H3M1:60000-CatalogJanitor exiting > 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.HMaster$1: > H3M1:60000-BalancerChore exiting > 2011-06-27 10:20:18,392 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server listener on 60000
<jdcry...@apache.org> wrote: > Yeah the BulkAssigner uses a AssignmentManager.assign method that > won't retry if it gets an exception because originally it was only > used during startup (eg if you can't assign the regions properly at > startup then kill yourself).
> In this case it happened to you when creating a table (it also uses > the BulkAssigner now). To me it's a bug, it should just be assigned > somewhere else. Mind filing a jira?
> As to how you got a 60 secs timeout, you should take a look at the RS > log (and next time it happens also do a jstack).
> Thx,
> J-D
> On Mon, Jun 27, 2011 at 9:54 AM, Matt Davies <matt.dav...@tynt.com> wrote: >> Hey all,
>> Interesting problem I'm running into. It appears that our master has died >> after a failed region assignment. Any ideas as to why?
>> We are running version 0.90.1 in this cluster. >> H3M1 = master >> H3S* = regionservers
>> out on H3S3 the regionserver process is running (but didn't respond per the >> log)
>> 2011-06-27 10:18:22,517 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to >> disable table ContentGuidance >> 2011-06-27 10:18:22,522 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 >> regions. >> 2011-06-27 10:18:23,527 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table >> is done=true >> 2011-06-27 10:18:24,536 INFO >> org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table >> operation C_M_DELETE_TABLE on table ContentGuidance >> 2011-06-27 10:18:24,584 INFO org.apache.hadoop.hbase.catalog.MetaEditor: >> Deleted region >> ContentGuidance,,1298676132327.9d63b6173c7307c27aa9728f738f0ef3. from META >> 2011-06-27 10:19:17,296 WARN org.apache.hadoop.hbase.zookeeper.ZKTable: >> Moving table ContentGuidance state to enabled but was already enabled >> 2011-06-27 10:19:17,313 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: >> HLog configuration: blocksize=64 MB, rollsize=60.8 MB, enabled=true, >> flushlogentries=1, optionallogflushinternal=1000ms >> 2011-06-27 10:19:17,349 INFO >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs >> -- HDFS-200 >> 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: >> New hlog >> /hbase/ContentGuidance/f38bbc9bc672b5ea415b6dcc7e1da43a/.logs/hlog.13091915 57313 >> 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: >> Using getNumCurrentReplicas--HDFS-826 >> 2011-06-27 10:19:17,433 INFO org.apache.hadoop.hbase.regionserver.HRegion: >> Onlined ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a.; >> next sequenceid=1 >> 2011-06-27 10:19:17,438 INFO org.apache.hadoop.hbase.catalog.MetaEditor: >> Added region >> ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. to META >> 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.HRegion: >> Closed ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. >> 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: >> IPC Server handler 87 on 60000.logSyncer exiting >> 2011-06-27 10:19:17,470 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1 region(s) >> round-robin across 4 server(s) >> 2011-06-27 10:19:17,472 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done >> 2011-06-27 10:19:17,480 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: H3S4,60020,1308946657703 >> unassigned znodes=1 of total=1 >> 2011-06-27 10:19:21,530 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to >> disable table ContentGuidance >> 2011-06-27 10:19:21,536 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 >> regions. >> 2011-06-27 10:19:22,543 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table >> is done=true >> 2011-06-27 10:19:23,544 INFO >> org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table >> operation C_M_DELETE_TABLE on table ContentGuidance >> 2011-06-27 10:19:23,551 INFO org.apache.hadoop.hbase.catalog.MetaEditor: >> Deleted region >> ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. from META >> 2011-06-27 10:20:17,479 FATAL org.apache.hadoop.hbase.master.HMaster: Failed >> assignment of regions to serverName=H3S3,60020,1308946657608, >> load=(requests=0, regions=7, usedHeap=976, maxHeap=1011) >> java.net.SocketTimeoutException: Call to H3S3/10..x.x.x3:60020 failed on >> socket timeout exception: java.net.SocketTimeoutException: 60000 millis >> timeout while waiting for channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 >> remote=H3S3/10.x.x.x:60020] >> at >> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) >> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) >> at $Proxy7.openRegions(Unknown Source) >> at >> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.j ava:566) >> at >> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.j ava:776) >> at >> org.apache.hadoop.hbase.master.AssignmentManager$SingleServerBulkAssigner.r un(AssignmentManager.java:1310) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.j ava:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 908) >> at java.lang.Thread.run(Thread.java:619) >> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while >> waiting for channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 >> remote=H3S3/10..x.x.x:60020] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164 ) >> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) >> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) >> at java.io.FilterInputStream.read(FilterInputStream.java:116) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBa seClient.java:299) >> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >> at java.io.BufferedInputStream.read(BufferedInputStream.java:237) >> at java.io.DataInputStream.readInt(DataInputStream.java:370) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseCli ent.java:539) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477 ) >> 2011-06-27 10:20:17,482 INFO org.apache.hadoop.hbase.master.HMaster: >> Aborting >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.ipc.HBaseServer: Stopping >> server on 60000 >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.CatalogJanitor: >> H3M1:60000-CatalogJanitor exiting >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.HMaster$1: >> H3M1:60000-BalancerChore exiting >> 2011-06-27 10:20:18,392 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC >> Server listener on 60000
> Looks like I'll be fixing it in the context of HBASE-3984.
> J-D
> On Mon, Jun 27, 2011 at 10:21 AM, Jean-Daniel Cryans > <jdcry...@apache.org> wrote: > > Yeah the BulkAssigner uses a AssignmentManager.assign method that > > won't retry if it gets an exception because originally it was only > > used during startup (eg if you can't assign the regions properly at > > startup then kill yourself).
> > In this case it happened to you when creating a table (it also uses > > the BulkAssigner now). To me it's a bug, it should just be assigned > > somewhere else. Mind filing a jira?
> > As to how you got a 60 secs timeout, you should take a look at the RS > > log (and next time it happens also do a jstack).
> > Thx,
> > J-D
> > On Mon, Jun 27, 2011 at 9:54 AM, Matt Davies <matt.dav...@tynt.com> > wrote: > >> Hey all,
> >> Interesting problem I'm running into. It appears that our master has > died > >> after a failed region assignment. Any ideas as to why?
> >> We are running version 0.90.1 in this cluster. > >> H3M1 = master > >> H3S* = regionservers
> >> out on H3S3 the regionserver process is running (but didn't respond per > the > >> log)
> >> 2011-06-27 10:18:22,517 INFO > >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to > >> disable table ContentGuidance > >> 2011-06-27 10:18:22,522 INFO > >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 > >> regions. > >> 2011-06-27 10:18:23,527 INFO > >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled > table > >> is done=true > >> 2011-06-27 10:18:24,536 INFO > >> org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table > >> operation C_M_DELETE_TABLE on table ContentGuidance > >> 2011-06-27 10:18:24,584 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > >> Deleted region > >> ContentGuidance,,1298676132327.9d63b6173c7307c27aa9728f738f0ef3. from > META > >> 2011-06-27 10:19:17,296 WARN org.apache.hadoop.hbase.zookeeper.ZKTable: > >> Moving table ContentGuidance state to enabled but was already enabled > >> 2011-06-27 10:19:17,313 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > >> HLog configuration: blocksize=64 MB, rollsize=60.8 MB, enabled=true, > >> flushlogentries=1, optionallogflushinternal=1000ms > >> 2011-06-27 10:19:17,349 INFO > >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using > syncFs > >> -- HDFS-200 > >> 2011-06-27 10:19:17,350 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > >> New hlog
> /hbase/ContentGuidance/f38bbc9bc672b5ea415b6dcc7e1da43a/.logs/hlog.13091915 57313 > >> 2011-06-27 10:19:17,350 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > >> Using getNumCurrentReplicas--HDFS-826 > >> 2011-06-27 10:19:17,433 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > >> Onlined > ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a.; > >> next sequenceid=1 > >> 2011-06-27 10:19:17,438 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > >> Added region > >> ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. to META > >> 2011-06-27 10:19:17,439 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > >> Closed ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. > >> 2011-06-27 10:19:17,439 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > >> IPC Server handler 87 on 60000.logSyncer exiting > >> 2011-06-27 10:19:17,470 INFO > >> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1 > region(s) > >> round-robin across 4 server(s) > >> 2011-06-27 10:19:17,472 INFO > >> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done > >> 2011-06-27 10:19:17,480 INFO > >> org.apache.hadoop.hbase.master.AssignmentManager: > H3S4,60020,1308946657703 > >> unassigned znodes=1 of total=1 > >> 2011-06-27 10:19:21,530 INFO > >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to > >> disable table ContentGuidance > >> 2011-06-27 10:19:21,536 INFO > >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 > >> regions. > >> 2011-06-27 10:19:22,543 INFO > >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled > table > >> is done=true > >> 2011-06-27 10:19:23,544 INFO > >> org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table > >> operation C_M_DELETE_TABLE on table ContentGuidance > >> 2011-06-27 10:19:23,551 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > >> Deleted region > >> ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. from > META > >> 2011-06-27 10:20:17,479 FATAL org.apache.hadoop.hbase.master.HMaster: > Failed > >> assignment of regions to serverName=H3S3,60020,1308946657608, > >> load=(requests=0, regions=7, usedHeap=976, maxHeap=1011) > >> java.net.SocketTimeoutException: Call to H3S3/10..x.x.x3:60020 failed on > >> socket timeout exception: java.net.SocketTimeoutException: 60000 millis > >> timeout while waiting for channel to be ready for read. ch : > >> java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 > >> remote=H3S3/10.x.x.x:60020] > >> at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) > >> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) > >> at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > >> at $Proxy7.openRegions(Unknown Source) > >> at
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.j ava:566) > >> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.j ava:776) > >> at
> org.apache.hadoop.hbase.master.AssignmentManager$SingleServerBulkAssigner.r un(AssignmentManager.java:1310) > >> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.j ava:886) > >> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 908) > >> at java.lang.Thread.run(Thread.java:619) > >> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while > >> waiting for channel to be ready for read. ch : > >> java.nio.channels.SocketChannel[connected local=/10..x.x.x:34904 > >> remote=H3S3/10..x.x.x:60020] > >> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164 ) > >> at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >> at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >> at java.io.FilterInputStream.read(FilterInputStream.java:116) > >> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBa seClient.java:299) > >> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > >> at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > >> at java.io.DataInputStream.readInt(DataInputStream.java:370) > >> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseCli ent.java:539) > >> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477 ) > >> 2011-06-27 10:20:17,482 INFO org.apache.hadoop.hbase.master.HMaster: > >> Aborting > >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > >> server on 60000 > >> 2011-06-27 10:20:18,390 INFO > org.apache.hadoop.hbase.master.CatalogJanitor: > >> H3M1:60000-CatalogJanitor exiting > >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.HMaster$1: > >> H3M1:60000-BalancerChore exiting > >> 2011-06-27 10:20:18,392 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > IPC > >> Server listener on 60000