Eureka cluster in weird state after dynamically disabling self preservation mode

1,299 vistas
Ir al primer mensaje no leído

Panagiotis Partheniadis

no leída,
15 abr 2014, 10:55:30 a.m.15/4/14
para eureka_...@googlegroups.com
Hello,

We have 2 Eureka instances that form a cluster in Amazon cloud. One instance per AZ. Recently, we had to shut down all the registered Service instances due to maintenance but failed to stop the Eureka instances too. The outcome of this was for the Eureka instances to go into self preservation mode. This really messed up our registries, as expected. When we got wind of the problem, we dynamically disabled the mode by setting eureka.enableSelfPreservation=false without shutting down and restarting the Eureka instances. The mode was disabled indeed, and the obsolete registrations immediately vanished. However, we got a weird outcome: each Eureka instance now maintains the Service registrations that adhere to the same AZ and not both. They don't seem to synchronise their contents any more. If we register a new service with 2 instances (one per AZ), this is only registered in the Eureka instance of the same AZ, but correctly: both instances of the registered Service. However, the contents are not synchronised any more. Does this feel like a bug? 

Nitesh

no leída,
17 abr 2014, 3:41:40 a.m.17/4/14
para eureka_...@googlegroups.com
Yes it does seem like a bug. By design selfPreservationMode=false must not stop the replication between the instances.
Do you see any errors in the logs?

Panagiotis Partheniadis

no leída,
17 abr 2014, 9:36:01 a.m.17/4/14
para eureka_...@googlegroups.com
Nope. No errors in logs 2 days after the incident. Do you want me to search for ERRORS at the time we applied the dynamic change to eureka.enableSelfPreservation=false?

The only thing that i see in logs and it is very odd, are the following 2 lines:

23:59:06.103 INFO  DiscoveryClient_CacheRefresher   - Finished a call to service url http://localhost:8010/eureka/v2/ and url path apps/delta with status code 200.

23:59:06.104 INFO  DiscoveryClient_CacheRefresher   - Completed cace refresh task for discovery. All Apps hash code is Local region apps hashcode: OUT_OF_SERVICE_1_STARTING_10_UP_39_, is fetching remote regions? false 

The "OUT_OF_SERVICE_1_STARTING_10_UP_39_" is probably the app hash code BEFORE the incident. Currently, the app hash code reported by both instances is "UP_21_" (which should have been UP_42_ if synchronisation worked as expected) . So, this log entry is completely off...

Panagiotis Partheniadis

no leída,
17 abr 2014, 9:55:22 a.m.17/4/14
para eureka_...@googlegroups.com
Also no errors exactly after the incident. I got a bunch of entries like: 

14:13:52.783 WARN  Eureka-EvictionTimer   - DS: Registry: expired lease for HERMES - i-b2235dee 

for all the instances that were expired and then a bunch of entries like:

14:13:52.829 INFO  qtp2102078799-114409 - DELETE /eureka/v2/apps/HERMES/i-b2235dee   - Not Found (Cancel): HERMES - i-b2235dee

14:13:52.829 WARN  batcher.localhost-Cancel-process   - PeerEurekaNode: http://localhost:8010/eureka/v2/apps/: HERMES/i-b2235dee : delete: missing entry.

again for all the instances that were expired.

Nitesh Kant

no leída,
21 abr 2014, 2:47:04 a.m.21/4/14
para eureka_...@googlegroups.com
These expiries were on the peer node which does not get the instance registrations directly?
Yeah looks like the servers are not replicating.
Does this issue go away after restart of the eureka servers?
Can you provide a more concrete reproducible steps so that I can debug this more?


--
You received this message because you are subscribed to the Google Groups "eureka_netflix" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eureka_netfli...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Panagiotis Partheniadis

no leída,
21 abr 2014, 11:38:30 a.m.21/4/14
para eureka_...@googlegroups.com
The steps that led to this problem are:
1). Shut down all the Service instances registered in Eureka. As the enableSelfPreservation was true, both Eureka server instances, one in each AZ got into self preservation mode.
2). For some time, various registrations of 2 (one per AZ) Service instances took place. As preferSameZone=true, each instance registered itself in the Eureka instance located in the same AZ. 
3). For some time, various requests for un-registration happened. As the Eureka server instances, were into self preservation mode, these never took place.
4). Dynamically changed the eureka.enableSelfPreservation=false for the Eureka server instance in the first AZ.
5). Dynamically changed the eureka.enableSelfPreservation=false for the Eureka server instance in the second AZ.

That's it.

Do you have an insight on what can we do in order to remedy the problem with the minimum impact in our environment? This is not a production env but a staging one, but still, it would be nice if we had a proper resolution strategy without messing up the system. 
E.g. what we should expect to happen if we re-started only 1 of the 2 Eureka server instances? Should we first switch the enableSelfPreservation back to true, so that the Eureka clients of the registered Services do not clean up their local cache due to this re-start? what can we expect out of this move? Both Eureka instances get in sync? Or maybe only the re-started one gets the full info and then we need to do the same with the other one?

hy...@netflix.com

no leída,
22 abr 2014, 6:11:12 p.m.22/4/14
para eureka_...@googlegroups.com
Panagiotis,

I'm looking into this right now. Will update the thread once I come up with something.

Nitesh Kant

no leída,
22 abr 2014, 7:24:54 p.m.22/4/14
para eureka_...@googlegroups.com
Answering your question about the resolution of this problem.

Firstly, you should re-set the property <prefix>.enableSelfPreservation to true if all your applications are back to normal i.e. correctly publishing their status.

Now, if you restart any of the eureka server, the clients will start connecting to the other eureka server to send heartbeats (which will in turn re-register if this eureka server does not know about this app instance) and hence the other server will know about this application. When the other server comes back, the clients should revert back to using the other eureka server. If you repeat this for all the servers as a rolling re-start, your eureka cluster should come back to normal. This is particularly true since no state is persisted by eureka server itself, all data is in-memory.

There is a slight caveat here that during restart of a server, the new applications start connecting to the other eureka server and since the other instances in the other AZ is not available with this server, these clients will not see them. However, if you have redundancy for the same application in two AZs you should be fine as you will atleast have one instance of the same app in both eureka servers.

hy...@netflix.com

no leída,
22 abr 2014, 7:29:40 p.m.22/4/14
para eureka_...@googlegroups.com
Panagiotis,

Based on what I can see from design and code, self preservation shouldn't prevent anything from being replicated. Just want to clarify something before I set out to reproduce it myself. 

2). For some time, various registrations of 2 (one per AZ) Service instances took place. As preferSameZone=true, each instance registered itself in the Eureka instance located in the same AZ. 
3). For some time, various requests for un-registration happened. As the Eureka server instances, were into self preservation mode, these never took place.

both register and un-register happened before you set the enableSelfPreservation flag to false or after? I assumed they're after. But steps indicated otherwise?
For 3), self preservation shouldn't prevent explicit un-register from happening. Only thing server does is to protect from itself and not evict expired leases. Can you elaborate on your 3) statement?

Panagiotis Partheniadis

no leída,
23 abr 2014, 6:12:05 a.m.23/4/14
para eureka_...@googlegroups.com
Hello,

2) and 3) statements are before we set enableSelfPreservation to false. I'm just describing what was happening for days and we ended up with instances that were removed from cloud but were still reported in Eureka and with instances that were registered to only one of the two Eureka instances. 

Indeed, we have seen the case of explicitly shutting down a service instance and still not to be removed from the registry. I know that this is supposed not to happen, but still... I know that i'm probably not helping much, but it was a mess and we tried to fix it somehow before we had the time to check things for hours. Imagine that we ended up having new Services to be reported to be available in instances that were previously occupied by other Services! And this is because we re-used the old IPs in the VPC for new Services that did not manage to overwrite the old info in Eureka registry. So, we had IP X with Service A before, and after we had IP X with Service B installed, but still in Eureka, the old info were reported. We ended up giving completely wrong IPs for service discovery.

Panagiotis Partheniadis

no leída,
23 abr 2014, 6:32:11 a.m.23/4/14
para eureka_...@googlegroups.com
Now, trying to follow Nitesh's instructions on how to resolve the sync problem:

Did not manage to have a complete success.
1). I first changed selPreservationMode back to true, in both Eureka instances.
2). I stopped the one of the two Eureka instances. 
3). Indeed, after a while, all clients reported to the other instance.
4). I re-started the instance again.
5). ONLY the service instances in the same AZ switched back to the newly started Eureka instance. 
6). Final view: the restarted instance registry contains only the registrations of the services in the same AZ. The other instance indeed contains everything. 

So:
- Is this the supposed behaviour? The one instance is the superset of the other.
- if i make a change in the status of a service instance that is reported by both registries, i do get the change reflected in both, so sync seems to work. But if i make a change in the status of a service instance that is reported only by the registry which contains the superset, apparently, this is the only place i see the change.
- another problem i got: When i restarted the Eureka instance, the newly reported instances, report as STARTING. This, along with the fact that in this registry i end up having only the registrations of the same AZ, probably messes up the Eureka clients that prefer this AZ and report that they cannot find a valid instance when discovering.

Panagiotis Partheniadis

no leída,
23 abr 2014, 7:19:04 a.m.23/4/14
para eureka_...@googlegroups.com
Another thing:

I continuously get errors like that now:

08:33:46.813 ERROR qtp1851524641-98 - PUT /eureka/v2/apps/CONTRACTSDS/i-527b3f0e?status=STARTING&lastDirtyTimestamp=1398204537170   - The lease expiration has been disabled since the number of renewals per minute   is lower than the minimum threshold. Number of Renewals Last Minute : 112. The Threshold is 0.85 of total instances : 0

for all of my service registrations.

Howard Yuan

no leída,
23 abr 2014, 11:47:56 a.m.23/4/14
para eureka_...@googlegroups.com
Thanks for the information. I think your situation very likely is related to some other environment issues other than the enableSelfPreservation flag. Let me absorb your info and try to make sense of them. 


--
You received this message because you are subscribed to a topic in the Google Groups "eureka_netflix" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/eureka_netflix/LZCkrbalYJU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to eureka_netfli...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Thanks,

Howard

Panagiotis Partheniadis

no leída,
23 abr 2014, 2:35:03 p.m.23/4/14
para eureka_...@googlegroups.com
After some digging, we found a bug in the process that shuts down our stacks that prohibited our Service instances to gracefully shut down and therefore explicitly un-register from Eureka server. So, you can scratch the following:

3). For some time, various requests for un-registration happened. As the Eureka server instances, were into self preservation mode, these never took place.

hy...@netflix.com

no leída,
23 abr 2014, 5:33:44 p.m.23/4/14
para eureka_...@googlegroups.com
There're quite some info you're listed here. Can you do me a favor and do another reboot for one of the two Eureka servers and do a few register in each zone and confirm the new registries only show up in its own zone's Eureka server, but not the other zone's? (Assumption here is that non-rebooted server will have a hole view, the newly booted server will only have a subset view of its own zone).

Once you confirm that, can you send me the logs on both servers? You can send me separately at hy...@netflix.com? If you can clean up both logs just prior to reboot, that'll be great.

Also, what's your Eureka version? Just want to make sure it's not some older versions which have some known bugs.

Just give you some ideas on where I'm going.
1. I hope to eliminate the possibility of you're using too early a version so I'm not chasing on the wrong path.
2. Replication itself seems to be happening, but there's some kind of basic checks we put in when we merge the replication results. Maybe that's preventing from the info from other zone's server from showing on the other server?
3. All your instance info only show "STARTING" is bit weird. Do you have a healthcheck function registered?

Anyways, I have quite a few questions in mind so I want to see the raw logs for myself.

Panagiotis Partheniadis

no leída,
30 abr 2014, 9:33:45 a.m.30/4/14
para eureka_...@googlegroups.com
We checked the thing you said about version, and indeed we are using a rather old one. So, before we get into more research, we decided to upgrade to the latest one which for us is 1.1.126. The process we are following is this:

We currently have a 1+1 instances, in 2 AZs. We decided to go for 2+2 after the upgrade. So, first i added a new instance version in AZ1. Then, i replaced the old instance with a new one in the same AZ1. After that, and having 2 latest versions in the same AZ1, we don't even get synced data between these 2! I don't know is the interaction with the instance on the other AZ2 is the cause of this, but we will see what happens when we make the switch for the other AZ too. Meanwhile, i send you the 2 first log files: the first is for the instance that is added as new to the stack of AZ1 and the other is for the instance that replaced the old one in the same stack AZ1. We will do the same with the AZ2 and let you know.

Panagiotis Partheniadis

no leída,
30 abr 2014, 10:10:21 a.m.30/4/14
para eureka_...@googlegroups.com
Another thing that may be problematic, is the way we set up the cluster. We use the following configuration:

       <region>us-west-1</region>
        <preferSameZone>true</preferSameZone>
        <shouldUseDns>false</shouldUseDns>
        <waitTimeInMsWhenSyncEmpty>0</waitTimeInMsWhenSyncEmpty>
        <datacenter>cloud</datacenter>
        <us-west-1>
            <availabilityZones>us-west-1a,us-west-1c</availabilityZones>
        </us-west-1>
        <serviceUrl>
            <us-west-1a>{{us-west-1a-eureka-instance}}/eureka/v2/</us-west-1a>
            <us-west-1c>{{us-west-1c-eureka-instance}}/eureka/v2/</us-west-1c>
        </serviceUrl>

So, we maintain 2 ASGs and create 2 Eureka stacks. One specifically for us-west-1a and one for us-west-1c. In the aforementioned configuration, the {{us-west-1a-eureka-instance}} is the ELB address of the 1a ASG and the {{us-west-1c-eureka-instance}} is the ELB address of the 1c ASG. This is how we avoid setting specific IPs in the configuration and also avoid using EIPs. Do you think this is a problematic setup?

Panagiotis Partheniadis

no leída,
1 may 2014, 4:44:48 a.m.1/5/14
para eureka_...@googlegroups.com
So, after i pushed the new version to the other ASG stack and trying to continue with the 2+2 setup, nothing seemed to be in sync, even for the 2 instances in the same AZ. Apparently, the problem with that is the setup we use and already mentioned: we use in our configuration the ELB url where the 2 instances in the same AZ are registered behind and not 2 distinct urls, one for each instance. I get understand that it is impossible for these 2 instances to properly sync. So, i removed the 1 instance from each AZ, went back to 1+1 setup (where the usage of the ELB url is ok as there is only 1 instance behind it) and voila! Everything worked as a charm! The remaining 1+1 instances managed to properly get in sync. I also added a new stack and changed its status using the REST endpoint of one of the instances and both got in sync. 

So, probably this whole mess was due to the old version i was using. To be sure of course, i need to see what happens when the self preservation mode kicks in.

Also, i would like your opinion on the dual ASG setup that we are using. It seems to work ok when we have 1+1 instances, it does not work of course if we go to 2+2, 3+3 etc. Are there any shortcomings that we may have following this approach, assuming that 1+1 is ok for us?

Howard Yuan

no leída,
1 may 2014, 12:47:33 p.m.1/5/14
para eureka_...@googlegroups.com
Good to see that you're making progress. A couple of things come to my mind,
1. 1+1 or 2+2 or 1+2 or any other configuration you feel like providing sufficient redundancy is fine as long as each server has its own serviceUrl. If multiple servers share the same url, then things won't work properly. 
2. One of the reasons that we came up with Eureka is to try to have one less dependency which is ELB. Since each server needs its own address/url anyways, one Eureka server behind one ELB seems unnecessary and may do more harm than good. 

Panagiotis Partheniadis

no leída,
5 may 2014, 2:00:06 a.m.5/5/14
para eureka_...@googlegroups.com
Yesterday, one of the two ASG's failed to pass a health check and decided to recycle its instance of Eureka server. Remember that we are talking about a 1+1 setup using 1+1 ASG's. Everything worked ok and, after the recycling, the 2 Eureka servers were in sync. But the problem is that nearly all registrations changed statuses from UP to STARTING. This happened to Service instances from both AZ's but also i had a few that have not switched to STARTING at all. Can we identify the problem with that? I guess that Eureka clients will pick up the new statuses and finally report that no instances are available to be discovered and this is a big problem as you can imagine.

Panagiotis Partheniadis

no leída,
5 may 2014, 2:08:07 a.m.5/5/14
para eureka_...@googlegroups.com

Correction: As so it happened, BOTH ASG's decided to re-cycle at the same time. So, there was probably some time that no Eureka servers were active. Still, when the clients re-connected with the new instances, why do they report as STARTING? Client should have the proper status, no?

Howard Yuan

no leída,
6 may 2014, 1:25:52 a.m.6/5/14
para eureka_...@googlegroups.com
I don't know for sure what happens here. But very likely is that your clients didn't register or update with Eureka with the UP status.

So now the problem is that instances' status are being "STARTING" instead of "UP". Let's try to focus on that. If you can isolate the issue and only have one new instance up and go through the server logs to see what server is saying. I would guess the status Eureka server gets is always "STARTING". Eureka client doesn't know if your server is really UP or not (although there's flag to config to register the service as "UP" at the startup time, but it might start taking traffic too early). Your application have to tell Eureka server that your service is "UP" and ready for traffic.

Panagiotis Partheniadis

no leída,
6 may 2014, 2:51:56 a.m.6/5/14
para eureka_...@googlegroups.com
Yes, our Services indeed set the initial status to STARTING. And we set it to UP using the REST API endpoint through a custom Console we maintain. Reason for this is the one you described about not getting traffic immediately. So, after some time, we manually set the status to UP. Then, one of the instances of Eureka server is recycled. What happens when it gets back up? Do the Services try to newly register to the new Eureka server instance? If they do, i can understand why they register as STARTING. But here, we have also a sync process with the other Eureka instance where all instances have a status of UP. Does the sync process happens before or after the registration of the new instances? Or it happens at the same time? And what is the outcome of the sync then? The same Service registration is reported in the Eureka instance that did not recycle as UP and at the re-cycled instance as STARTING. I have the feeling that the latest, for some reason wins, and we end up with all registrations in all Eureka instances switching to STARTING. In my opinion, this should not be happening. Because then, the feature of registering a Service initially as STARTING, is essentially problematic and should  never be used.

Howard Yuan

no leída,
6 may 2014, 12:10:50 p.m.6/5/14
para eureka_...@googlegroups.com
I can't be 100% sure of the whole picture just based on emails. A few things come to mind,
1. I remember I read somewhere that both servers happened to reboot? If so, the manual UP status will be lost.
2. But I think the bigger point of all these is that your app has to be able to register/update with proper status based on your app's status. Manual setting should only be used for deployment/debugging. Normal operations should be all based on app's logic. Usually people do a few things to make this happen. In start up process, once server fully come up, update with Eureka as UP. In Shutdown hook, once the shutdown process starts, update with Eureka as DOWN. Also, you can register an app healthcheck with Eureka so everytime healthcheck goes bad, Eureka will know and mark the instance as DOWN. Since all these are initiated from Eureka instances, the instances will remain proper status for themselves and the heartbeat update will carry the right status. So even both server start, the applications' status will be captured soon.
Responder a todos
Responder al autor
Reenviar
0 mensajes nuevos