Recovering zookeeper state

1,534 views
Skip to first unread message

John Fanjoy

unread,
Jan 25, 2016, 10:00:02 PM1/25/16
to marathon-framework
Hello,

I have been troubleshooting some event api issues I thought might have been related to the latest update via the mesosphere repo provided version of marathon (~0.14.0), so I downgraded which did not resolve the issue. Amidst my hamfisting I managed to get marathon into a state where it was unable to elect a master. To resolve I removed the /marathon znode and corresponding children under the assumption that reconciliation would result in a reconstructed state, however that did not occur. Does anyone know how to rebuild the zookeeper state for marathon so that I can reclaim all of my tasks which are still running on the mesos agent servers (for now). I'm afraid these tasks will fail and be unable to restart which would be a very bad day indeed. Any help is greatly appreciated. If you have any questions, please let me know.


Thank you,

John

Brenden Matthews

unread,
Jan 25, 2016, 10:26:15 PM1/25/16
to John Fanjoy, marathon-framework
Hey John,

The first thing I'd suggest you do, is make a backup copy of all the ZK data.

After that, you can try restoring a ZK cluster from previous snapshots, and see if the data is still there. There's no guarantee it will work, but it's probably your best shot at this point. I'd stand up a 1 node ZK cluster, check the result from all the previous snapshots, and repeat until you have something that resembles the previous good state.

Does that make sense?

--
You received this message because you are subscribed to the Google Groups "marathon-framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marathon-framew...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Fanjoy

unread,
Jan 25, 2016, 10:46:06 PM1/25/16
to Brenden Matthews, marathon-framework

It does make sense. I've never had to restore zookeeper from a snapshot but it seems fairly straightforward. I figured i would need to use the snapshots so i stopped our cron task which purges all but the past few to make sure we don't lose anything we had there. Once i have a close state is there a good way to dump the data from one cluster and load it into the other?

Brenden Matthews

unread,
Jan 25, 2016, 11:20:40 PM1/25/16
to John Fanjoy, marathon-framework
You could do it 1 of 2 ways:

1. Restart the cluster with the good data, or
2. Boot up a new cluster with the data, and configure Marathon to use the new cluster

There does exist a ZK copy tool (https://github.com/kshchepanovskyi/zkcopy) which I've used in the past with success, but I'm not sure how well that will work in this case.

You should also check out the ZK docs on the snapshots, before trying to restore anything: https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_dataFileManagement

John Fanjoy

unread,
Jan 26, 2016, 10:42:15 AM1/26/16
to marathon-framework
Brenden,

Thank you for your help so far. I have recovered the state as it was last night, but now I am back to the point where marathon seems to be unable to elect a leader. I can use zkCli to list all of my tasks, but the ui is not functional and I imagine failures are still not going to be handled correctly. This issue occurred after downgrading to 0.11 and then upgrading to 0.13. I read somewhere about immutable znodes, but I don't know how to see if that is the problem or not, or the proper way to recover if it is.


- John

Jeremy Olexa

unread,
Jan 26, 2016, 12:26:43 PM1/26/16
to marathon-framework

Hi John,


I would be careful bouncing between these versions. You may be hitting the breaking change in v 0.13 it looks like it is not backwards compatible.


https://github.com/mesosphere/marathon/issues/2405 (Issue that some people had around v0.13)

https://github.com/mesosphere/marathon/releases/tag/v0.13.0 (Breaking Change release notes)


-Jeremy




From: marathon-...@googlegroups.com <marathon-...@googlegroups.com> on behalf of John Fanjoy <john....@gmail.com>
Sent: Tuesday, January 26, 2016 9:42 AM
To: marathon-framework
Subject: Re: Recovering zookeeper state
 
--

Brenden Matthews

unread,
Jan 26, 2016, 1:11:24 PM1/26/16
to John Fanjoy, marathon-framework
You can try to force a re-election by deleting the `/marathon/leader` node in ZK. I'm not sure if that will resolve your particular problem, however.

--

Aaron

unread,
Jan 26, 2016, 5:33:24 PM1/26/16
to marathon-framework
I had a similar problem, and I don't recall exactly what I had to do to fix it.  But I think it was that my Marathon nodes had a newer ZK configuration than the ZK masters, so they wouldn't connect.  I had to delete the ZK metadata on the Marathon nodes and rejoin them to the ZK cluster.  My apps/tasks were not important to me, so I'm not sure what effect that may have on your running tasks.

John Fanjoy

unread,
Jan 28, 2016, 8:26:36 AM1/28/16
to Aaron, marathon-framework

Thanks everyone for your help. I ended up restoring from the previous snapshot and then I completely reinstalled marathon which resolved my leader election issue. It seems like the downgrade affected the java installation in some way that i missed originally.


--
You received this message because you are subscribed to a topic in the Google Groups "marathon-framework" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/marathon-framework/EfhJ9A_6myc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to marathon-framew...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages