Hi,
We are using Lily 2.0. For the second time in the last two weeks our live site went down. I have provided lily-client and server logs below.
I believe the primary reason is the zookeeper disconnect. In a separate thread Bruno mentioned that Lily should be able to survive a zookeeper disconnect. Do we need to move to the latest Lily version?
For zookeeper, I have also increased the syncLimit to 10 with a ticktime of 2000 but that doesn't seem to help.
thanks
Prashant
The trouble starts like this: Lily-client logs:
2013-11-03 16:19:19, 267ZooKeeper disconnected at Sun Nov 03 16:45:11 UTC 2013
ZooKeeper connected at Sun Nov 03 16:45:14 UTC 2013
ZooKeeper connected at Sun Nov 03 16:45:22 UTC 2013
470745027 [TypeManager cache refresher] ERROR org.lilyproject.client.RemoteSchemaCache - Error refreshing type manager cache. Cache is possibly out of date!
2013-11-03 16:49:14, 597java.lang.reflect.UndeclaredThrowableException
at $Proxy9.getTypesWithoutCache(Unknown Source)
at org.lilyproject.repository.impl.AbstractSchemaCache.refreshAll(AbstractSchemaCache.java:338)
at org.lilyproject.repository.impl.AbstractSchemaCache.access$800(AbstractSchemaCache.java:63)
at org.lilyproject.repository.impl.AbstractSchemaCache$CacheRefresher.run(AbstractSchemaCache.java:596)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=10, exceptions:
Sun Nov 03 16:45:42 UTC 2013, org.apache.hadoop.hbase.client.ScannerCallable@1e30a83e, java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
Lily-server logs:
[WARN ] <2013-11-03 16:45:16,678> (org.lilyproject.util.zookeeper.StateWatchingZooKeeper): Disconnected from ZooKeeper
[INFO ] <2013-11-03 16:45:16,685> (org.lilyproject.util.zookeeper.LeaderElection): No longer leader for the position of RowLog Processor mq
[INFO ] <2013-11-03 16:45:16,685> (org.lilyproject.util.zookeeper.LeaderElection): No longer leader for the position of RowLog Processor wal
[INFO ] <2013-11-03 16:45:16,685> (org.lilyproject.util.zookeeper.LeaderElection): No longer leader for the position of Blob Incubator Monitor
[INFO ] <2013-11-03 16:45:16,685> (org.lilyproject.util.zookeeper.LeaderElection): No longer leader for the position of Indexer Master
[INFO ] <2013-11-03 16:45:16,687> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Shutting down row log processor for wal
[INFO ] <2013-11-03 16:45:16,689> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Shutting down row log processor for mq
[INFO ] <2013-11-03 16:45:16,689> (org.lilyproject.indexer.master.IndexerMaster): Shutting down as indexer master.
[INFO ] <2013-11-03 16:45:16,736> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Shutdown of row log processor sucessful for wal
[WARN ] <2013-11-03 16:45:16,736> (org.lilyproject.util.zookeeper.StateWatchingZooKeeper): Connected to ZooKeeper
[INFO ] <2013-11-03 16:45:16,751> (org.lilyproject.util.zookeeper.LeaderElection): Elected as leader for the position of RowLog Processor mq
[INFO ] <2013-11-03 16:45:16,765> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Shutdown of row log processor sucessful for mq
[INFO ] <2013-11-03 16:45:16,765> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Starting row log processor for mq
[INFO ] <2013-11-03 16:45:16,775> (org.lilyproject.util.zookeeper.LeaderElection): Elected as leader for the position of RowLog Processor wal
[INFO ] <2013-11-03 16:45:16,775> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Starting row log processor for wal
[INFO ] <2013-11-03 16:45:16,821> (org.lilyproject.util.zookeeper.LeaderElection): Elected as leader for the position of Blob Incubator Monitor
[INFO ] <2013-11-03 16:45:16,823> (org.lilyproject.util.zookeeper.LeaderElection): Elected as leader for the position of Indexer Master
[INFO ] <2013-11-03 16:45:18,773> (org.lilyproject.indexer.master.IndexerMaster): Shutdown as indexer master successful.
[INFO ] <2013-11-03 16:45:18,773> (org.lilyproject.indexer.master.IndexerMaster): Starting up as indexer master.
[INFO ] <2013-11-03 16:45:18,773> (org.lilyproject.indexer.master.IndexerMaster): Startup as indexer master successful.
[INFO ] <2013-11-03 16:45:27,665> (org.lilyproject.rowlog.impl.RowLogProcessorImpl): Maximum global queue scan threads set to 1
[INFO ] <2013-11-03 16:45:27,666> (org.lilyproject.rowlog.impl.RowLogProcessorImpl): Maximum global queue scan threads set to 1
[INFO ] <2013-11-03 16:45:27,667> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Startup of row log processor successful for mq
[INFO ] <2013-11-03 16:45:27,668> (org.lilyproject.rowlog.impl.RowLogProcessorImpl): RowLog scan batch size (on each shard/split): 1000
[INFO ] <2013-11-03 16:45:27,668> (org.lilyproject.rowlog.impl.RowLogProcessorImpl): RowLog messages work queue size: 1000
[INFO ] <2013-11-03 16:45:27,697> (org.lilyproject.rowlog.impl.RowLogProcessorImpl): RowLog scan batch size (on each shard/split): 1000
[INFO ] <2013-11-03 16:45:27,697> (org.lilyproject.rowlog.impl.RowLogProcessorImpl): RowLog messages work queue size: 1000
[INFO ] <2013-11-03 16:45:27,698> (org.lilyproject.rowlog.impl.RowLogProcessorElection): Startup of row log processor successful for wal
[WARN ] <2013-11-03 16:50:39,121> (org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation): Encountered problems when prefetch META table:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=10, exceptions:
Sun Nov 03 16:47:24 UTC 2013, org.apache.hadoop.hbase.client.HTable$4@79871bba, java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=
slave1.truegether.com/10.185.6.143:60020]
Sun Nov 03 16:47:45 UTC 2013, org.apache.hadoop.hbase.client.HTable$4@79871bba, java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=
slave1.truegether.com/10.185.6.143:60020]
Sun Nov 03 16:48:07 UTC 2013, org.apache.hadoop.hbase.client.HTable$4@79871bba, java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=
slave1.truegether.com/10.185.6.143:60020]
Sun Nov 03 16:48:14 UTC 2013, org.apache.hadoop.hbase.client.HTable$4@79871bba, java.net.NoRouteToHostException: No route to host