Configurations stored in ZooKeeper get wiped out

641 views
Skip to first unread message

Nathan McGinnis

unread,
Jun 21, 2016, 5:14:37 PM6/21/16
to exhibitor-users
Hello Everyone,

I have a very strange issue and I'm wondering if anyone in the community has any idea what could be the issue.  Our configs that are stored in Zookeeper somehow are getting wiped out (collection info, schema, solrconfigs, etc) at completely random times.  There is no correlation between time or system load, as it happens at random times and on production and non-live clusters.  We've had clusters up and taking production traffic for over a year before the configs all of a sudden disappear.  This same issue just happened again this morning on a cluster that had only been running for 3 weeks.

This is a Solr 5.3.1 cluster which points to a 3-node Zookeeper 3.4.8 ensemble.  All Zk nodes also have Exhibitor 1.5.5 which uses the shared s3config to setup the ensemble. All systems run on CentOS 7 in AWS.  We are running a very similar setup on prem without Exhibitor and have never had this issue in the years its been running.

We were alerted this morning that a single ZK node was not connectable (not responsive to ruok).  Checking the Exhibitor UI during the issue, I immediately saw that this server had been removed by the auto instance management but then was re-added once it became available with a different ID.  I also noticed in the "Explore" tab, all of the configs were gone with the exception of some files in /overseer and /overseer_elect directories.

This is the very first sign of issues in all 3 Zookeeper.out logs (this is the server that went down):

-----
2016-06-21 10:00:35,869 [myid:10] - WARN [RecvWorker:9:QuorumCnxManager$RecvWorker@810] - Connection broken for id 9, my id = 10, error = 
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795)

2016-06-21 10:00:40,655 [myid:10] - WARN [QuorumPeer[myid=10]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)

2016-06-21 10:00:40,656 [myid:10] - WARN [SendWorker:11:QuorumCnxManager$SendWorker@727] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
2016-06-21 10:00:40,656 [myid:10] - INFO [QuorumPeer[myid=10]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
-----

I then see a lot of FastLeaderElection messages with state "LOOKING", followed by some Solr connections being refused because "ZooKeeperServer is not running".  There was also a massive amount of "New Election" messages logged (100k in 10 mins) on this single node.  

-----
2016-06-21 10:02:50,680 [myid:] - ERROR [main:QuorumPeerMain@85] - Invalid config, exiting abnormally
org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error processing /opt/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:123)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:101)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.lang.IllegalArgumentException: /var/lib/zookeeper/myid file is missing
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:341)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:119)
... 2 more
Invalid config, exiting abnormally

2016-06-21 10:03:59,754 [myid:12] - WARN  [QuorumPeer[myid=12]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.io.IOException: No log files found to truncate! This could happen if you still have snapshots from an old setup or log files were deleted accidentally or dataLogDir was changed in zoo.cfg.
at org.apache.zookeeper.server.persistence.FileTxnLog.truncate(FileTxnLog.java:368)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.truncateLog(FileTxnSnapLog.java:259)
at org.apache.zookeeper.server.ZKDatabase.truncateLog(ZKDatabase.java:438)
at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:343)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:82)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
-----

We also found errors related to myid missing which appears to be due to Exhibitor's auto instance management. You can see when this server finally came back up it took on a new id 12 (was 10).  I have no idea if the configs stored in Zookeeper would still be there if Exhibitor was not in the picture.  Does anyone have any in sight into this very bizarre issue?

Thank you,
Nathan
 

Anup Shrivastava

unread,
Jun 21, 2016, 8:44:02 PM6/21/16
to exhibit...@googlegroups.com

Hi Nathan,

Please check zookeeper config file content on location that has been set for s3config property.

It seems there is some connectivity issue with your hosted zookeeper and the s3config bucket location. There should be one file exebdtor....property that will have complete config information.

PS : If you are using zookeeper for production then you must have setup enabled zoodata backup and restore for any emergency.

Regards!
Anup

.

> --
> You received this message because you are subscribed to the Google Groups "exhibitor-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to exhibitor-use...@googlegroups.com.
> To post to this group, send email to exhibit...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/exhibitor-users/92f76f53-6f66-4f58-a8b1-44837ee4f73d%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Nathan McGinnis

unread,
Jun 27, 2016, 1:31:04 PM6/27/16
to exhibitor-users
I reached out to AWS support and they confirmed there were not any issues with our instances reaching the s3 bucket with the shared config.  This issue just happened again with another cluster when I spun up a new Zookeeper (swapping them out for instances with an upgraded java version).  Basically, once the new instance came up (4 total now), it appears to have wiped the configurations stored in Zookeeper.

What I believe is happening (or at least makes some sense to me), when the new instance is joining the ensemble the other instances are being bounced by auto instance management before all the files are replicated.  So now this new host without the configs is leader and the other hosts come back up as followers.  It is recoverable, but very time consuming to create a new collection and manually copy the index data.

It seems like our resolution in the interim will be to disable auto instance management once the cluster is formed.
Reply all
Reply to author
Forward
0 new messages