What's the right resync period value for informers?

6,339 views
Skip to first unread message

Timo Reimann

unread,
Sep 7, 2017, 5:24:32 AM9/7/17
to K8s API Machinery SIG
Hello,

I have a question regarding building controllers with the help of the informer framework (using Go, if that matters). Specifically, every informer must be given a resync period that defines the frequency by which all watched objects get a chance to be revisited by means of a full list. To my understanding, this is to guard against transient errors on the watch part of the ListWatch interface (though I'm not perfectly clear if this means one of client-side or server-side errors only).

Recommendations found in the wild as to the right resync period seem to gravitate towards "keep it rather high to avoid burdening the API server too much", and I often see values on the order of multiple minutes or even hours. However, this advice seems to render the resync period unsuitable as a means to compensate for watch errors for somewhat latency-critical controllers.

To add a real-world example: I am working on the Traefik Ingress controller which watches over a user-provided list of namespaces. For each namespace, Traefik retrieves Ingress, Service, and Endpoints objects to build up the routing table. Missing one of the objects means we cannot create a proper forwarding rule for the affected application(s) until the situation is remedied. So my tendency is to pick a rather low resync period (preferably no more than a minute) to make sure the routing gets fixed rather quickly, but that seems contrary to the common advice of what I should be doing.

I guess my questions boil down to the following:

1. Am I missing the point of the resync period somehow?
2. Are there any guidelines on how to choose the right value?
3. Is there maybe a way to measure the performance impact it has on the API server?

Thanks in advance for any advice, suggestions, and help in general.

Timo

David Eads

unread,
Sep 7, 2017, 8:08:48 AM9/7/17
to Timo Reimann, K8s API Machinery SIG
The resync period does more to compensate for problems in the controller code than to compensate for missed watch event.  We have had very few "I missed an event bug" (I can think of one in recent memory), but we have had many controllers which use the resync as a big "try again" button.

A resync is different than a relist.  The resync plays back all the events held in the informer cache.  A relist hits the API server to re-get all the data.

Since we introduced the rate limited work queue a few releases ago, the need to wait for a resync to retry has largely disappeared since an error during processing gets requeued on an incrementing delay.

Think of the resync as insurance.  You probably want it set more than a few minutes, less than a few hours.  If you're using requeuing correctly and avoiding panics, you aren't likely to benefit from rescanning all the data very often.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/c24cfffc-bc0a-4042-8631-26a2dcd40a61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Smith

unread,
Sep 7, 2017, 12:44:33 PM9/7/17
to David Eads, Timo Reimann, K8s API Machinery SIG
When 
* you are writing a controller which interacts both with Kubernetes and some other outside system (like, say, some cloud's load balancer API), and 
* the outside system doesn't offer a watchable interface, so
* you intend to poll the outside system (to make sure it stays correct / repair it), then
It makes sense to use resync to trigger the polling behavior, and the resync period can be chosen based on how often you want to reconcile.

...you just can't forget to also have a system that periodically finds all the objects you've created in the past but need to be deleted (i.e., don't leak).

On Thu, Sep 7, 2017 at 5:08 AM, David Eads <de...@redhat.com> wrote:
The resync period does more to compensate for problems in the controller code than to compensate for missed watch event.  We have had very few "I missed an event bug" (I can think of one in recent memory), but we have had many controllers which use the resync as a big "try again" button.

A resync is different than a relist.  The resync plays back all the events held in the informer cache.  A relist hits the API server to re-get all the data.

Since we introduced the rate limited work queue a few releases ago, the need to wait for a resync to retry has largely disappeared since an error during processing gets requeued on an incrementing delay.

Think of the resync as insurance.  You probably want it set more than a few minutes, less than a few hours.  If you're using requeuing correctly and avoiding panics, you aren't likely to benefit from rescanning all the data very often.
On Thu, Sep 7, 2017 at 5:24 AM, 'Timo Reimann' via K8s API Machinery SIG <kubernetes-sig-api-machinery@googlegroups.com> wrote:
Hello,

I have a question regarding building controllers with the help of the informer framework (using Go, if that matters). Specifically, every informer must be given a resync period that defines the frequency by which all watched objects get a chance to be revisited by means of a full list. To my understanding, this is to guard against transient errors on the watch part of the ListWatch interface (though I'm not perfectly clear if this means one of client-side or server-side errors only).

Recommendations found in the wild as to the right resync period seem to gravitate towards "keep it rather high to avoid burdening the API server too much", and I often see values on the order of multiple minutes or even hours. However, this advice seems to render the resync period unsuitable as a means to compensate for watch errors for somewhat latency-critical controllers.

To add a real-world example: I am working on the Traefik Ingress controller which watches over a user-provided list of namespaces. For each namespace, Traefik retrieves Ingress, Service, and Endpoints objects to build up the routing table. Missing one of the objects means we cannot create a proper forwarding rule for the affected application(s) until the situation is remedied. So my tendency is to pick a rather low resync period (preferably no more than a minute) to make sure the routing gets fixed rather quickly, but that seems contrary to the common advice of what I should be doing.

I guess my questions boil down to the following:

1. Am I missing the point of the resync period somehow?
2. Are there any guidelines on how to choose the right value?
3. Is there maybe a way to measure the performance impact it has on the API server?

Thanks in advance for any advice, suggestions, and help in general.

Timo

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

Timo Reimann

unread,
Sep 8, 2017, 11:32:35 AM9/8/17
to K8s API Machinery SIG
Thanks for the feedback, that was really helpful. In particular, I wasn't aware that resyncs and relists were two separate things -- great to know.

One follow-up question: If I'm seeing correctly, then the Kubernetes scheduler runs with its resync period being disabled. Under which circumstances would you not want to have an "insurance" (as David put it)?


On Thursday, September 7, 2017 at 6:44:33 PM UTC+2, lavalamp wrote:
When 
* you are writing a controller which interacts both with Kubernetes and some other outside system (like, say, some cloud's load balancer API), and 
* the outside system doesn't offer a watchable interface, so
* you intend to poll the outside system (to make sure it stays correct / repair it), then
It makes sense to use resync to trigger the polling behavior, and the resync period can be chosen based on how often you want to reconcile.

...you just can't forget to also have a system that periodically finds all the objects you've created in the past but need to be deleted (i.e., don't leak).
On Thu, Sep 7, 2017 at 5:08 AM, David Eads <de...@redhat.com> wrote:
The resync period does more to compensate for problems in the controller code than to compensate for missed watch event.  We have had very few "I missed an event bug" (I can think of one in recent memory), but we have had many controllers which use the resync as a big "try again" button.

A resync is different than a relist.  The resync plays back all the events held in the informer cache.  A relist hits the API server to re-get all the data.

Since we introduced the rate limited work queue a few releases ago, the need to wait for a resync to retry has largely disappeared since an error during processing gets requeued on an incrementing delay.

Think of the resync as insurance.  You probably want it set more than a few minutes, less than a few hours.  If you're using requeuing correctly and avoiding panics, you aren't likely to benefit from rescanning all the data very often.
On Thu, Sep 7, 2017 at 5:24 AM, 'Timo Reimann' via K8s API Machinery SIG <kubernetes-sig...@googlegroups.com> wrote:
Hello,

I have a question regarding building controllers with the help of the informer framework (using Go, if that matters). Specifically, every informer must be given a resync period that defines the frequency by which all watched objects get a chance to be revisited by means of a full list. To my understanding, this is to guard against transient errors on the watch part of the ListWatch interface (though I'm not perfectly clear if this means one of client-side or server-side errors only).

Recommendations found in the wild as to the right resync period seem to gravitate towards "keep it rather high to avoid burdening the API server too much", and I often see values on the order of multiple minutes or even hours. However, this advice seems to render the resync period unsuitable as a means to compensate for watch errors for somewhat latency-critical controllers.

To add a real-world example: I am working on the Traefik Ingress controller which watches over a user-provided list of namespaces. For each namespace, Traefik retrieves Ingress, Service, and Endpoints objects to build up the routing table. Missing one of the objects means we cannot create a proper forwarding rule for the affected application(s) until the situation is remedied. So my tendency is to pick a rather low resync period (preferably no more than a minute) to make sure the routing gets fixed rather quickly, but that seems contrary to the common advice of what I should be doing.

I guess my questions boil down to the following:

1. Am I missing the point of the resync period somehow?
2. Are there any guidelines on how to choose the right value?
3. Is there maybe a way to measure the performance impact it has on the API server?

Thanks in advance for any advice, suggestions, and help in general.

Timo

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Daniel Smith

unread,
Sep 8, 2017, 12:33:07 PM9/8/17
to Timo Reimann, K8s API Machinery SIG
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

David Eads

unread,
Sep 8, 2017, 12:35:06 PM9/8/17
to Daniel Smith, Timo Reimann, K8s API Machinery SIG
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

We've still been catching bugs in this space, but they've been mostly in the controllers.  The PV controller relying on resync comes to mind.  The insurance is largely for the controller code, not so much the list/watch (though we did find a bug in the current release there too). 

On Fri, Sep 8, 2017 at 12:33 PM, 'Daniel Smith' via K8s API Machinery SIG <kubernetes-sig...@googlegroups.com> wrote:
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

Daniel Smith

unread,
Sep 8, 2017, 12:39:38 PM9/8/17
to David Eads, Timo Reimann, K8s API Machinery SIG
On Fri, Sep 8, 2017 at 9:35 AM, David Eads <de...@redhat.com> wrote:
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

We've still been catching bugs in this space, but they've been mostly in the controllers.  The PV controller relying on resync comes to mind.  The insurance is largely for the controller code, not so much the list/watch (though we did find a bug in the current release there too). 

Right, and I think the scheduler was pretty solid on this front last time I looked at it, which was quite a while ago. But yeah, if the question was more general-- periodic resync should guard against controller bugs (but controller authors should probably treat anything that turns up as a bug and fix it).
 

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

Timo Reimann

unread,
Sep 10, 2017, 4:21:53 PM9/10/17
to K8s API Machinery SIG
On Friday, September 8, 2017 at 6:33:07 PM UTC+2, lavalamp wrote:
I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.

Not exactly sure I'm understanding this point: Does it mean that I (as the controller implementor) am responsible for issuing a relist if I detect a connection issue with my client? In particular, does that hold for client-go?

 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Daniel Smith

unread,
Sep 11, 2017, 12:29:05 PM9/11/17
to Timo Reimann, K8s API Machinery SIG
On Sun, Sep 10, 2017 at 1:21 PM, 'Timo Reimann' via K8s API Machinery SIG <kubernetes-sig...@googlegroups.com> wrote:
On Friday, September 8, 2017 at 6:33:07 PM UTC+2, lavalamp wrote:
I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.

Not exactly sure I'm understanding this point: Does it mean that I (as the controller implementor) am responsible for issuing a relist if I detect a connection issue with my client? In particular, does that hold for client-go?

If you use the standard tools, the relist will happen automatically if it is necessary; but since that is relatively rare, it's good to make sure it is exercised in tests in case it exercises a corner case in your controller code.
 

 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

Maxim Ivanov

unread,
Dec 9, 2018, 8:49:24 AM12/9/18
to K8s API Machinery SIG
With standard SharedInformers, do relists happen at all? I can't find any List calls in Reflector's code apart from initial call when reflector is setup. In other words, if Resync is and insurance for controller's code, what is currently recommended way to insure for missed Watch events?
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Clayton Coleman

unread,
Dec 9, 2018, 9:31:31 AM12/9/18
to Maxim Ivanov, K8s API Machinery SIG
Regarding missed watch events, there should be no such thing.  If watch events are broken, all of Kubernetes is horribly broken (it would be like the database returning incorrect transactional results).  We do occasionally have bugs at higher levels that present this way, but they are typically very obvious.  We don’t do relists because we have the confidence in the data store, but earlier in the project we were less confident.

Now, if you are using an aggregated api not backed by etcd, you may need such behavior, and it may be something we can make easier to set on shared informers 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

Maxim Ivanov

unread,
Dec 9, 2018, 10:22:59 AM12/9/18
to K8s API Machinery SIG
There is a quite a big chunk of logic in shared informers around DeleteFinalStateUnknown , is it an artifact from times when Watch events were not considered reliable and now can be removed?

lig...@google.com

unread,
Dec 9, 2018, 10:32:13 AM12/9/18
to Maxim Ivanov, K8s API Machinery SIG
No, when a watch connection gets dropped, and an actual relist is required, and a previously present object is no longer present in the full relist, DeleteFinalStateUnknown is still applicable. 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

Clayton Coleman

unread,
Dec 9, 2018, 11:00:18 AM12/9/18
to lig...@google.com, Maxim Ivanov, K8s API Machinery SIG
For more color, for performance reasons the apiserver has an optional watch cache (enabled by default for many kube resources and distros, but not all) that has historically had bugs that cause data inconsistency at a higher rate than etcd (I’m not aware of any major ones in etcd beyond the HA etcd 2->3 migration issue Jordan helped track down).  That was what I was alluding to in the previous email, where we sometimes find bugs in our own caches that can impact informers.

So in short:

1. Re-list should not be necessary against kube apis
2. If you have aggregate apis backed by something besides etcd and k8s.io/apiserver code and you can’t make watch reliable, relist may be necessary (although unreliable watch is probably a worse idea than telling consumers you don’t support watch at all and forcing the informers to relist)
3. Even with reliable watch, if you are disconnected from an apiserver for some period of time you may not be able to watch because the server has expired that history, in which case a relist happens automatically and you will get things like deletionfinalstateunknown
4. If you crash, you will not even get deletionfinalstateunknown which is wheee DeltaFIFO and it’s “sync list” code comes into play - if you are trying to reconcile kube state to some external store it’s up to you to reconstruct that.

Aren’t distributed systems fun?

Also, this reminds me we need to come back to our “syncing to external systems” docs and do another review - it’s been a while.

Daniel Smith

unread,
Dec 9, 2018, 11:18:41 AM12/9/18
to Clayton Coleman, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
The purpose of the resync feature is no longer to recover from bugs, it's to let you poll an external system.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/6e446397-4fb6-4c95-afaa-97f8920fd95a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/303e649c-b72d-43ef-b16e-e3e604f18964%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/8D326120-2918-4BCE-9BDF-E246701AD6E6%40google.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Clayton Coleman

unread,
Dec 9, 2018, 11:27:36 AM12/9/18
to Daniel Smith, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
To recover from api machinery bugs yes.

It’s current primary purpose is to help you recover from your own bugs.  Which is the vast majority of new controller issues.

Maxim Ivanov

unread,
Dec 9, 2018, 11:46:12 AM12/9/18
to K8s API Machinery SIG
Thanks, I've opened https://github.com/kubernetes/kubernetes/pull/71893 with small correction of docs about Reflector's resync param, hopefully it is accurate.

Daniel Smith

unread,
Dec 9, 2018, 2:24:53 PM12/9/18
to Clayton Coleman, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
I'm not really convinced that's a good backup plan--if your controller gets it wrong once, why won't it continue getting it wrong?

I think exhaustive testing is there way to go there. At the very least, make a metric that increments when you perform an action on a resync, so you can know if there's a bug or not...

Clayton Coleman

unread,
Dec 9, 2018, 7:23:12 PM12/9/18
to Daniel Smith, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
I dunno, all the bugs I ever see in controllers are race conditions, and sync eventually fixes them.

mspr...@us.ibm.com

unread,
Dec 9, 2018, 9:21:53 PM12/9/18
to K8s API Machinery SIG
But resync will not find something that (needs to be deleted downstream because the API object is already gone).

My bottom line is: write your controller to use requeuing correctly and synchronize state with other stuff correctly, and forget about resync.

Regards,
Mike

Max Ivanov

unread,
Dec 10, 2018, 3:27:48 AM12/10/18