rebuilding all cluster nodes

Alexey Popov

unread,

Oct 15, 2013, 7:11:15 AM10/15/13

to project-...@googlegroups.com

Hi All,

I'm running a cluster of 32 boxes. Due to ongoing infrastructure transformation I'm going to have to rebuild all of them using different Linux version and also change hostnames and IP addresses. I'm trying to do it in a least risky way. Please take a look at my plan and let me know if it has a chance of success:

1) rebuild all nodes one after one, preserving all configs with old hostnames and data directory. Have old IP addresses on secondary interfaces.
2) shut down all nodes, update cluster.xml with new hostnames and start everything up. Also probably restart all clients.

I'm worried that on step 2 the cluster for some reason may go crazy with new cluster.xml. What do you think?

Thanks,
Alex

Vin C

unread,

Oct 16, 2013, 3:16:38 AM10/16/13

to project-...@googlegroups.com

Brendan Harris (Voldemort SRE at Linkedin) has dealt with things like this a lot.. I would wait for him for a day or two before taking a stab at this..

At a high level, it seems like you are willing to tolerate downtime.. if thats the case, I think the plan will work if you bounce all clients after the new cluster build out.. (zero downtime solution should also work. Let me know if you need one. we may have to check a few corner cases)

Alexey Popov

unread,

Oct 16, 2013, 10:49:40 AM10/16/13

to project-...@googlegroups.com

Thanks Vin,

We can tolerate some downtime, I'm just trying to make sure that I don't kill anything irreversibly. I heard that voldemort tracks cluster.xml changes, but I'm not sure what for and what's its action when it detects config change. So I thought once I update cluster.xml on all nodes on the final step and start them up, it might go terribly wrong.

What's the zero downtime solution, out of interest, briefly? There's a very little info on managing voldemort online.

Vin C

unread,

Oct 16, 2013, 2:44:11 PM10/16/13

to project-...@googlegroups.com

Okay.. the cluster.xml version tracking is simply for the purposes of the ZenStoreClient (a subclass of your defaultstoreclient). ZenStoreClient periodically pulls down the the cluster.xml (and stores.xml) from the servers and check if the client has an outdated version of metadata.. This is so that the clients pick up metada changes in a bounded amount of time.. So, even if you start afresh , delete the version information, your client bounce will get to back live and kicking. So, just make sure you don't change the cluster layout (interms of number of servers and partitions) and you should be fine..

The zero downtime way to do this is a series of host swaps... Basically, we bring a server down.. update cluster.xml with new hostname, let the clients pick it up, bring the new server back online.. (slops should take care of data written to the server in that time since its all nodeid based.) Hope it helps.

Alex P

unread,

Oct 17, 2013, 5:48:40 AM10/17/13

to project-...@googlegroups.com

Thanks Vin C,

Very helpful!

Yep, I'm just changing Linux version and IPs/hostnames, but I'm preserving cluster.xml (except for hostnames) and all other configs and data .

In zero downtime scenario, when I swap a host, do I need to update cluster.xml across the whole cluster, restarting all nodes? Or maybe the clients (using config version tracking) will always use the most recent cluster.xml? I heard cluster.xml must be the same on all cluster nodes.

Alex P

unread,

Oct 24, 2013, 5:22:07 AM10/24/13

to project-...@googlegroups.com

Hi Vin C,

Could you explain a bit more about host swaps? I rebuild a host, update cluster.xml everywhere and restart all nodes, all clients and repeat with all other nodes one after one, right?

Thanks

On Wednesday, 16 October 2013 19:44:11 UTC+1, Vin C wrote:

Justin Mason

unread,

Oct 24, 2013, 5:43:26 AM10/24/13

to project-voldemort

hi Alex -- what I've done in the past is basically this:

- for each node in cluster:

- rsync data from oldnode to newnode (using "ionice -c3" to avoid major impact to operation latencies)

- take down oldnode

- rsync deltas from oldnode to newnode

- update cluster.xml with newnode's name and the same node number

- start up newnode

- use voldemort-admin-tool to push that new cluster.xml to all other server nodes

- if you are using ZenStoreClient, all clients should pick that up within a minute or two. otherwise you need to restart the client fleet -- which was pretty inconvenient, so yay for ZenStoreClient ;)

- iterate.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/groups/opt_out.

Alex P

unread,

Oct 24, 2013, 6:03:46 AM10/24/13

to project-...@googlegroups.com, j...@jmason.org

Got it, thanks Justin.

Rather than doing 32 host swaps in my case, do you think it's possible to swap more nodes at once? For example, take down 8 nodes, which have their partitions replicated elsewhere, start up 8 new nodes with new cluster.xml and use voldemort-admin tool to propagate cluster.xml and possibly restart the clients?

Justin Mason

unread,

Oct 24, 2013, 8:42:38 AM10/24/13

to project-voldemort

On Thu, Oct 24, 2013 at 11:03 AM, Alex P <vta...@gmail.com> wrote:

Rather than doing 32 host swaps in my case, do you think it's possible to swap more nodes at once? For example, take down 8 nodes, which have their partitions replicated elsewhere, start up 8 new nodes with new cluster.xml and use voldemort-admin tool to propagate cluster.xml and possibly restart the clients?

as I understand it, you run the risk of data loss in this scenario, unless you can turn off reads and writes to the cluster during this period. Doing it a node at a time allows the in-built data replication to provide sufficient replicas to still satisfy reads and writes from. If you take down too many nodes, there's a chance that too many replicas for a piece of data will be stored on downed nodes, hence the data would be unavailable since the required-reads or required-writes quorum wouldn't be reached.

--j.

Reply all

Reply to author

Forward