Rolling upgrade 1.9.4 to 3.1.5 on Amazon AWS

Matt

unread,

May 14, 2014, 6:59:14 AM5/14/14

to haze...@googlegroups.com

Hi,

We're attempting to do a "rolling upgrade" whereby an old set of servers is replaced with an upgraded set of servers without downtime. The new servers include a Hazelcast upgrade from 1.9.4 to 3.1.5. We can't change the group name (for reasons I won't go into at this point) so the plan was that we would change the password so that the new servers form an operational cluster of their own, before we switch off the old ones.

For example:

- Existing servers A and B are running 1.9.4 with Hz password "goodbye"

- New server C is created that has 3.1.5 and is configured with Hz password "hello"

- New server D is created with 3.1.5 and password "hello"

- New servers should form new and separate cluster from the old machines

- Load balancing etc. is switched to point to the new servers

- Servers A and B are destroyed

My experiments with Hazelcast suggest that the same group name but different passwords will allow two valid and separate clusters to form.

We are however getting these warnings:

2014-05-14 10:52:46,579 WARN  [com.hazelcast.nio.ReadHandler] [A.A.A.A]:5802 [my_group_name] hz._hzInstance_1_my_group_name.IO.thread-in-1 Closing socket to endpoint null, Cause:java.lang.IllegalArgumentException: Packet versions are not matching! This -> 1, Incoming -> 0
java.lang.IllegalArgumentException: Packet versions are not matching! This -> 1, Incoming -> 0
	at com.hazelcast.nio.Packet.readFrom(Packet.java:113)
	at com.hazelcast.nio.SocketPacketReader$DefaultPacketReader.readPacket(SocketPacketReader.java:67)
	at com.hazelcast.nio.SocketPacketReader.read(SocketPacketReader.java:49)
	at com.hazelcast.nio.ReadHandler.handle(ReadHandler.java:70)
	at com.hazelcast.nio.InSelectorImpl.handleSelectionKey(InSelectorImpl.java:33)
	at com.hazelcast.nio.AbstractIOSelector.run(AbstractIOSelector.java:124)

Where A.A.A.A is the IP of a new machine.

I wouldn't have expected this problem since the remote (old) server has not joined the cluster - I would have expected invalid password log messages instead.

Is this anything to worry about? Perhaps the packet handling routines are versioned even for cluster joins?

Thanks very much for your help!

Matt

Noctarius

unread,

May 14, 2014, 7:55:36 AM5/14/14

to haze...@googlegroups.com

Hi Matt,

I really strongly suggest to not do this!!!

Version 1 and 3 are totally incompatible so even the login packet (as you mentioned below) is not recognized as a login packet. There can happen all kinds of unexpected behavior and since you cannot move data from 1.x to 3.x I recommend to separate the networks using either iptables or VLAN (other sufficient network techniques). Do not try to use both versions in the same cluster name, it might work or it doesn’t - you’ll not know.

Chris

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/e530c1d4-528b-4705-9b62-a5360ef05bb7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt

unread,

May 14, 2014, 8:22:17 AM5/14/14

to haze...@googlegroups.com, noctar...@googlemail.com

Thanks for you speedy reply! Could you give an example of what problems we might see? Keep in mind that we won't be using both clusters operationally for any significant time - the old servers will be destroyed. Also, if the members cannot login, then I would hope they can't do too much damage (otherwise that would mean that any random network traffic could potentially cause problems?)

Hopefully by understanding what we're up against we can make an informed decision on what to do.

Thanks again

Matt

Noctarius

unread,

May 14, 2014, 8:30:19 AM5/14/14

to Matt, haze...@googlegroups.com

I’m not sure what kind of problems might occur but as an example from 3.1 to 3.2 (which actually was a bug and shouldn’t happen but you never know what will happen ;-)) when a 3.1 client tried to connect to 3.2 (internally rewrite from blocking io to nio) the 3.2 cluster member just stopped responding to any client requests.

Since we never tested your scenario and probably never will there is nothing in terms of dangerous areas I can tell you about. The internals are completely different and you can expect everything to happen from packets that look like another thinggy and create buffer under- or overflows to illegal access exceptions (on sun.misc.Unsafe) or whatever else you can imagine.

Reply all

Reply to author

Forward