Hello,
We've had a replication problem with Eureka over the last few months, and we've run out of ideas to work around it. We're hoping for some advice.
Our environment
Our services all register with Eureka, and we have three Eureka servers in production. We have two primary applications that live side-by-side on a server. There are 60 of these servers, which are Amazon instances. Each one runs PRIMARY_APP and ADMIN_APP, but on different ports. Both applications register themselves with Eureka.
We deploy multiple times every day, and when we do, we update both applications on a server at the same time. Here's the process:
- Pick half of the servers and update the Eureka metadata on them: takingTraffic=false.
- Routers read this metadata and remove all traffic from them.
- Stop the applications, update them, and warm them up.
- Each updated application re-registers itself with Eureka.
- When all of the first half have been updated, update the metadata again to put traffic back.
- Repeat the process for the second half.
Our problem
The problem we've been having is that much of the time, these metadata updates don't propagate to the other Eureka servers properly. Right after the update, the metadata looks consistent, but then shortly thereafter, the takingTraffic property in the metadata reverts for some servers. It doesn't happen all at once, and it may affect none, some, or all of the servers.
After some examination of the replication code, we thought that simply updating the metadata would be insufficient to correctly synchronize between the servers, so we attempted to use the status OUT_OF_SERVICE during an update instead. Unfortunately, we witnessed the same behavior, even with just 4 servers. At first, all four had the correct status, but in just a few seconds, one of them reverted back to UP.
Does anyone have any suggestions or insight for us?