Hi Daniel,
Following is just a (random) list of problems that I can think of. Most likely you know most of them already. I don't think it is complete. I hope it will be of any use.
The general idea that you go through all instances one, by one, deploy new one, register in load balancer, unregister old one from load balancer, stop internal processing, wait till it stop and kill the old instance, then go do the same with next one.
Load balancer
You need a cluster and a front end load balancer of some sort. It can be hardware, it can haproxy or another Vertx verticle, whatever works best for you. It can be many machines as well. This way you can bring new instances up on any port and register them in load balancer. Then you unregister the old version, wait for it to finish all requests (it is easy with HTTP, as there would be no new request, but in terms of internal communication and internal processing you need a stop switch) and kill it. If you have nice homogenic cluster you can stop machine first and then deploy new version so you don't have to double the ports or keep spare machine for rolling update. Be aware that some services need some warm up time and should be registered in load balancer after they are warm so they will not slow down the cluster.
Versioning of api
Are you enriching existing API or creating a new one? New one is simple, just deploy it on different path - use version in the path.
But if you are adding new features to existing API, you have 2 verticles serving different things on exactly the same path and you have limited control over who will get what. So old verticle can start request, but new verticle will serve part of it to finally end up on old verticle again. Thus if you are not versioning the external api (and you are not able to route based on version) the changes need to be compatible backward and forward.
In practice you most likely will get all 3 cases eventually:
1) bugfixes update - no change of API, no new fields, just different internal logic
2) small additions - new data that should be ignored but not lost by old version
3) big changes - new versions
The hardest point IMHO is nr 2.
You need to test API compatibility, always, before deploying. Can this addition will kill our users? What will happen if half of the flow will be run on new and half on old?
Fortunatelly when using schemaless stuff, life is easier here.
Versioning of internal messaging
Although in most cases you cannot force people to use different version of your API for each change. You can version internal API much finer. Then the new version (let sey 2.1.1234) do all it's calls to new version endpoints (internal rest calls, websockets, eventbus, :8080/v2.1/service) and old run stuff on old paths, so none is conflicting with another.
In some cases (especially when we have persistence ;) - jms, database, etc.) you might need to have the new version also handle the old data / communication. Thus the new version is also listening to v2.0 and modernize stuff once it finds it.
So we might need to have support new and old logic in single version of app and then you need to have a way to handle that in your code.
Format of data / shared cluster data / hazelcast etc.
As said, option 2 - small additions can be most complex. If we talking about small change that should not "change" the major version of API, we need to be sure that new / different data that we send over REST, JMS, Eventbus, memory grid, database will not kill the old client. Or if both version use same data, that no data will be lost (on read new data by old version and save it back). In such case you need schemaless / not validated files, or schemas that allow for all tags/fields that they don't know.
There are two main issues here:
1) you try to parse/validate the data and your client/old version of app throw exception
2) you try to read and right new entity with old version and you loose data.
1) is simple to solve, just allow for unknown fields in your input data. If you have completely different formats you need to version them.
some are simpler. The idea is that you use the object in it's native form if you can (json, xml) or if you need to parse / reformat it (i.e. map to Java class) - keep the stuff that you do not understand in a byte array (Protobuf, Kryo, Hazelcast, Coherence). This might be supported by your mapping technology, i.e.
https://blogs.oracle.com/felcey/entry/coherence_rolling_upgrades
When new version is deployed?
To finish the deployment you need to have all old nodes killed and any migration of data finished. In most cases you want to run the migration of data after the last old version have been killed, not sooner.
So here some questions:
- do you serve new logic right after new version has one(2, 3) instances or do you think old version will brake something ?
- do you need to wait for both - no old version running and data migration finished to switch from old logic to new logic?
- if you have to modernize data, can you do that in lazy way instead of batch, both? (then you may need to keep migration logic "forever")
You may want to do more than one deployment to deploy new version. I.e start with version that support old and new data, migrate, clean up stuff (remove columns in database), deploy simpler version that does not have old code.
Regards,
Krzysztof Kowalczyk