Ubuntu 14 to 16 upgrade

9 views
Skip to first unread message

Marco Shaw

unread,
Jan 29, 2019, 8:13:19 AM1/29/19
to riak-...@googlegroups.com
**RIAK novice**

I need to upgrade the base OS for a 5-node cluster.

My plan was to do one at a time:
0. (Maybe) disable RIAK from startup
1. Shut node down
2. Take a VMware snapshot
3. Restart the node
4. Run the Ubuntu upgrade process
5. Restart the node
6. Test
7. (Maybe) re-enable RIAK at startup
8. Move on to the others

Let's say my first test Node5 fails at #6.  Maybe RIAK fails to "fully sync".  Once that node is finally restarted, I'm assuming it will fully sync with the others?

What happens if all of a sudden after a first sync, I revert to the VMware snapshot because there's some kind of failure/issue?  Will that node resync with the others *again*?

I know that using VMware snapshots can be a no-no in various circumstances, but could the above plan have any issues?

Martin Sumner

unread,
Jan 29, 2019, 11:20:06 AM1/29/19
to Marco Shaw, riak-...@googlegroups.com
Marco,

If we're talking about nodes in the same cluster, then there's no direct full-sync process.  There are three independent processes that should bring a node back into sync:

Handoff.  When the node you're updating goes down, fallback vnodes on other nodes soon start receiving PUTs for all the PUTs that would have gone to the downed vnodes (as well as read repairs for all the keys subject to GET requests whilst the vnodes were down).  When the node comes backup, the handoff process should return all these "missing" PUTs to recovered vnodes.  However, this is a one-off process (once the handoff is complete, if you subsequently wind back the vnode to the previous state the handoff will not re-occur).

Read repair: If data is missing, and you request that data, the requests should still work (if r < n), but as part of collating the reads the fact that one node has a missing update will be detected, and the missing data will be filled in.  As long as you read your keys, eventually all missing data will be repaired even if handoff is incomplete or doesn't happen or there is a reversion post-handoff.

Active Anti-Entropy (needs to be turned to active):  This will continually monitor for gaps in your data between vnodes, and then actively repair it.  This will repair regardless of the success of handoff, or the reading of keys.


Note, although I described these processes as independent, they are subtly inter-linked.  Active Anti-Entropy only detects gaps, but triggers read repair to fill-in gaps.  Handoff doesn't just capture objects PUT during the outage, it also gets objects that are GET during the outage during read repair.

So the short answer is yes, eventually everything should get back in sync, even if you end up reverting your snapshot.   To ensure this process happens in a defined period (rather than "eventually") you will need Active Anti-Entropy configured.

Regards

Martin

--
You received this message because you are subscribed to the Google Groups "riak-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to riak-users+...@googlegroups.com.
To post to this group, send email to riak-...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/riak-users/CAG5NM6TYZD%3DtLnYH%2Bf83bj%3Dm0irN7_HuRaAoDm_wRaGaWbu9-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marco Shaw

unread,
Jan 29, 2019, 12:16:58 PM1/29/19
to riak-...@googlegroups.com
Thank you for the details!

AAE is enabled.  I've not been able to find anything (quickly) regarding "configuring AAE".  I did notice that when I tried to recover a node recently enough, that AAE took "some time", but I couldn't be bothered to keep an eye on it once it started to see how quickly it ran.  I did check roughly 24 hours later and it appeared to be all clear.

Martin Sumner

unread,
Jan 29, 2019, 12:23:50 PM1/29/19
to Marco Shaw, riak-...@googlegroups.com
Sorry, I meant 'enabled' when I said 'configured' ... i.e. you shouldn't need to do anything else.  Generally AAE repairs should happen within an hour, the only exception is when the AAE store is out-of-sync with the main KV store - but unless you have corruption on disk this shouldn't be an issue.



Reply all
Reply to author
Forward
0 new messages