Replication inaccuracies

81 views
Skip to first unread message

TysonS

unread,
Apr 18, 2014, 5:34:52 PM4/18/14
to eureka_...@googlegroups.com
Hello,

We've had a replication problem with Eureka over the last few months, and we've run out of ideas to work around it. We're hoping for some advice.

Our environment
Our services all register with Eureka, and we have three Eureka servers in production. We have two primary applications that live side-by-side on a server. There are 60 of these servers, which are Amazon instances. Each one runs PRIMARY_APP and ADMIN_APP, but on different ports. Both applications register themselves with Eureka.

We deploy multiple times every day, and when we do, we update both applications on a server at the same time. Here's the process:
  1. Pick half of the servers and update the Eureka metadata on them: takingTraffic=false.
  2. Routers read this metadata and remove all traffic from them.
  3. Stop the applications, update them, and warm them up. 
  4. Each updated application re-registers itself with Eureka.
  5. When all of the first half have been updated, update the metadata again to put traffic back.
  6. Repeat the process for the second half.
Our problem
The problem we've been having is that much of the time, these metadata updates don't propagate to the other Eureka servers properly. Right after the update, the metadata looks consistent, but then shortly thereafter, the takingTraffic property in the metadata reverts for some servers. It doesn't happen all at once, and it may affect none, some, or all of the servers. 

After some examination of the replication code, we thought that simply updating the metadata would be insufficient to correctly synchronize between the servers, so we attempted to use the status OUT_OF_SERVICE during an update instead. Unfortunately, we witnessed the same behavior, even with just 4 servers. At first, all four had the correct status, but in just a few seconds, one of them reverted back to UP. 

Does anyone have any suggestions or insight for us?

Nitesh Kant

unread,
Apr 21, 2014, 3:00:36 AM4/21/14
to eureka_...@googlegroups.com
How do you set the status of OUT_OF_SERVICE for the instance? Do you directly make an update on the server (by calling the status update API: PUT/eureka/v2/apps/appID/instanceID/status?value=OUT_OF_SERVICE) or you update the status locally on the application?
Inside Netflix, we generally update the discovery server for setting the instance OUT_OF_SERVICE, via asgard.

I can see some issues with setting the instance status locally and let the client take care of updating the status to the server.



--
You received this message because you are subscribed to the Google Groups "eureka_netflix" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eureka_netfli...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

TysonS

unread,
Apr 21, 2014, 11:59:50 AM4/21/14
to eureka_...@googlegroups.com
We PUT the status to Eureka with a separate application (not the applications being updated). They would still be sending heartbeats at the time and will only stop sending heartbeats when we take them offline for update.

hy...@netflix.com

unread,
Apr 21, 2014, 12:49:41 PM4/21/14
to eureka_...@googlegroups.com
@TysonS,

Right after the update, the metadata looks consistent, but then shortly thereafter, the takingTraffic property in the metadata reverts for some servers. It doesn't happen all at once, and it may affect none, some, or all of the servers. 

I assume when you say "takingTraffic property in the metadata reverts for some servers", you mean the property reverted from "false" to "true". Do you register a healthcheck function with Eureka client to report your app's health?  If so, it sounds like a bug that we fixed in latest Eureka release. The bug was about a race condition at shutdown time, Eureka's health check thread will re-register with Eureka server. When that happens, the metadata that you applied will be "reverted".

Can you please try latest Eureka release and see if that solves your issue? 

Nitesh Kant

unread,
Apr 21, 2014, 1:01:58 PM4/21/14
to eureka_...@googlegroups.com
The bug @hyuan is mentioning is this: https://github.com/Netflix/eureka/issues/98 and the fix is available in the release: https://github.com/Netflix/eureka/releases/tag/1.1.128

TysonS

unread,
Apr 21, 2014, 3:56:07 PM4/21/14
to eureka_...@googlegroups.com
Thanks @Nitesh and @hyuan. We'll grab the latest release and give it a try. I'll report back when we've got it live. 

BTW, is there a way to know which release we're running currently?

Nitesh Kant

unread,
Apr 21, 2014, 4:21:07 PM4/21/14
to eureka_...@googlegroups.com
@TysonS these changes will primarily affect the clients, so you would have to update your applications to this latest eureka release.

For eureka server, the eureka console provides the ami-id if that is helpful.

On your application end, if you use karyon, it has an admin module which lists down the jars present in the classpath, that would give you what you need.
Reply all
Reply to author
Forward
0 new messages