What's the right resync period value for informers?

5,525 views
Skip to first unread message

Timo Reimann

unread,
Sep 7, 2017, 5:24:32 AM9/7/17
to K8s API Machinery SIG
Hello,

I have a question regarding building controllers with the help of the informer framework (using Go, if that matters). Specifically, every informer must be given a resync period that defines the frequency by which all watched objects get a chance to be revisited by means of a full list. To my understanding, this is to guard against transient errors on the watch part of the ListWatch interface (though I'm not perfectly clear if this means one of client-side or server-side errors only).

Recommendations found in the wild as to the right resync period seem to gravitate towards "keep it rather high to avoid burdening the API server too much", and I often see values on the order of multiple minutes or even hours. However, this advice seems to render the resync period unsuitable as a means to compensate for watch errors for somewhat latency-critical controllers.

To add a real-world example: I am working on the Traefik Ingress controller which watches over a user-provided list of namespaces. For each namespace, Traefik retrieves Ingress, Service, and Endpoints objects to build up the routing table. Missing one of the objects means we cannot create a proper forwarding rule for the affected application(s) until the situation is remedied. So my tendency is to pick a rather low resync period (preferably no more than a minute) to make sure the routing gets fixed rather quickly, but that seems contrary to the common advice of what I should be doing.

I guess my questions boil down to the following:

1. Am I missing the point of the resync period somehow?
2. Are there any guidelines on how to choose the right value?
3. Is there maybe a way to measure the performance impact it has on the API server?

Thanks in advance for any advice, suggestions, and help in general.

Timo

David Eads

unread,
Sep 7, 2017, 8:08:48 AM9/7/17
to Timo Reimann, K8s API Machinery SIG
The resync period does more to compensate for problems in the controller code than to compensate for missed watch event.  We have had very few "I missed an event bug" (I can think of one in recent memory), but we have had many controllers which use the resync as a big "try again" button.

A resync is different than a relist.  The resync plays back all the events held in the informer cache.  A relist hits the API server to re-get all the data.

Since we introduced the rate limited work queue a few releases ago, the need to wait for a resync to retry has largely disappeared since an error during processing gets requeued on an incrementing delay.

Think of the resync as insurance.  You probably want it set more than a few minutes, less than a few hours.  If you're using requeuing correctly and avoiding panics, you aren't likely to benefit from rescanning all the data very often.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/c24cfffc-bc0a-4042-8631-26a2dcd40a61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Smith

unread,
Sep 7, 2017, 12:44:33 PM9/7/17
to David Eads, Timo Reimann, K8s API Machinery SIG
When 
* you are writing a controller which interacts both with Kubernetes and some other outside system (like, say, some cloud's load balancer API), and 
* the outside system doesn't offer a watchable interface, so
* you intend to poll the outside system (to make sure it stays correct / repair it), then
It makes sense to use resync to trigger the polling behavior, and the resync period can be chosen based on how often you want to reconcile.

...you just can't forget to also have a system that periodically finds all the objects you've created in the past but need to be deleted (i.e., don't leak).

On Thu, Sep 7, 2017 at 5:08 AM, David Eads <de...@redhat.com> wrote:
The resync period does more to compensate for problems in the controller code than to compensate for missed watch event.  We have had very few "I missed an event bug" (I can think of one in recent memory), but we have had many controllers which use the resync as a big "try again" button.

A resync is different than a relist.  The resync plays back all the events held in the informer cache.  A relist hits the API server to re-get all the data.

Since we introduced the rate limited work queue a few releases ago, the need to wait for a resync to retry has largely disappeared since an error during processing gets requeued on an incrementing delay.

Think of the resync as insurance.  You probably want it set more than a few minutes, less than a few hours.  If you're using requeuing correctly and avoiding panics, you aren't likely to benefit from rescanning all the data very often.
On Thu, Sep 7, 2017 at 5:24 AM, 'Timo Reimann' via K8s API Machinery SIG <kubernetes-sig-api-machinery@googlegroups.com> wrote:
Hello,

I have a question regarding building controllers with the help of the informer framework (using Go, if that matters). Specifically, every informer must be given a resync period that defines the frequency by which all watched objects get a chance to be revisited by means of a full list. To my understanding, this is to guard against transient errors on the watch part of the ListWatch interface (though I'm not perfectly clear if this means one of client-side or server-side errors only).

Recommendations found in the wild as to the right resync period seem to gravitate towards "keep it rather high to avoid burdening the API server too much", and I often see values on the order of multiple minutes or even hours. However, this advice seems to render the resync period unsuitable as a means to compensate for watch errors for somewhat latency-critical controllers.

To add a real-world example: I am working on the Traefik Ingress controller which watches over a user-provided list of namespaces. For each namespace, Traefik retrieves Ingress, Service, and Endpoints objects to build up the routing table. Missing one of the objects means we cannot create a proper forwarding rule for the affected application(s) until the situation is remedied. So my tendency is to pick a rather low resync period (preferably no more than a minute) to make sure the routing gets fixed rather quickly, but that seems contrary to the common advice of what I should be doing.

I guess my questions boil down to the following:

1. Am I missing the point of the resync period somehow?
2. Are there any guidelines on how to choose the right value?
3. Is there maybe a way to measure the performance impact it has on the API server?

Thanks in advance for any advice, suggestions, and help in general.

Timo

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

Timo Reimann

unread,
Sep 8, 2017, 11:32:35 AM9/8/17
to K8s API Machinery SIG
Thanks for the feedback, that was really helpful. In particular, I wasn't aware that resyncs and relists were two separate things -- great to know.

One follow-up question: If I'm seeing correctly, then the Kubernetes scheduler runs with its resync period being disabled. Under which circumstances would you not want to have an "insurance" (as David put it)?


On Thursday, September 7, 2017 at 6:44:33 PM UTC+2, lavalamp wrote:
When 
* you are writing a controller which interacts both with Kubernetes and some other outside system (like, say, some cloud's load balancer API), and 
* the outside system doesn't offer a watchable interface, so
* you intend to poll the outside system (to make sure it stays correct / repair it), then
It makes sense to use resync to trigger the polling behavior, and the resync period can be chosen based on how often you want to reconcile.

...you just can't forget to also have a system that periodically finds all the objects you've created in the past but need to be deleted (i.e., don't leak).
On Thu, Sep 7, 2017 at 5:08 AM, David Eads <de...@redhat.com> wrote:
The resync period does more to compensate for problems in the controller code than to compensate for missed watch event.  We have had very few "I missed an event bug" (I can think of one in recent memory), but we have had many controllers which use the resync as a big "try again" button.

A resync is different than a relist.  The resync plays back all the events held in the informer cache.  A relist hits the API server to re-get all the data.

Since we introduced the rate limited work queue a few releases ago, the need to wait for a resync to retry has largely disappeared since an error during processing gets requeued on an incrementing delay.

Think of the resync as insurance.  You probably want it set more than a few minutes, less than a few hours.  If you're using requeuing correctly and avoiding panics, you aren't likely to benefit from rescanning all the data very often.
On Thu, Sep 7, 2017 at 5:24 AM, 'Timo Reimann' via K8s API Machinery SIG <kubernetes-sig...@googlegroups.com> wrote:
Hello,

I have a question regarding building controllers with the help of the informer framework (using Go, if that matters). Specifically, every informer must be given a resync period that defines the frequency by which all watched objects get a chance to be revisited by means of a full list. To my understanding, this is to guard against transient errors on the watch part of the ListWatch interface (though I'm not perfectly clear if this means one of client-side or server-side errors only).

Recommendations found in the wild as to the right resync period seem to gravitate towards "keep it rather high to avoid burdening the API server too much", and I often see values on the order of multiple minutes or even hours. However, this advice seems to render the resync period unsuitable as a means to compensate for watch errors for somewhat latency-critical controllers.

To add a real-world example: I am working on the Traefik Ingress controller which watches over a user-provided list of namespaces. For each namespace, Traefik retrieves Ingress, Service, and Endpoints objects to build up the routing table. Missing one of the objects means we cannot create a proper forwarding rule for the affected application(s) until the situation is remedied. So my tendency is to pick a rather low resync period (preferably no more than a minute) to make sure the routing gets fixed rather quickly, but that seems contrary to the common advice of what I should be doing.

I guess my questions boil down to the following:

1. Am I missing the point of the resync period somehow?
2. Are there any guidelines on how to choose the right value?
3. Is there maybe a way to measure the performance impact it has on the API server?

Thanks in advance for any advice, suggestions, and help in general.

Timo

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Daniel Smith

unread,
Sep 8, 2017, 12:33:07 PM9/8/17
to Timo Reimann, K8s API Machinery SIG
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

David Eads

unread,
Sep 8, 2017, 12:35:06 PM9/8/17
to Daniel Smith, Timo Reimann, K8s API Machinery SIG
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

We've still been catching bugs in this space, but they've been mostly in the controllers.  The PV controller relying on resync comes to mind.  The insurance is largely for the controller code, not so much the list/watch (though we did find a bug in the current release there too). 

On Fri, Sep 8, 2017 at 12:33 PM, 'Daniel Smith' via K8s API Machinery SIG <kubernetes-sig...@googlegroups.com> wrote:
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

Daniel Smith

unread,
Sep 8, 2017, 12:39:38 PM9/8/17
to David Eads, Timo Reimann, K8s API Machinery SIG
On Fri, Sep 8, 2017 at 9:35 AM, David Eads <de...@redhat.com> wrote:
I believe we have quashed the bugs and insurance is not necessary or desirable any more.

We've still been catching bugs in this space, but they've been mostly in the controllers.  The PV controller relying on resync comes to mind.  The insurance is largely for the controller code, not so much the list/watch (though we did find a bug in the current release there too). 

Right, and I think the scheduler was pretty solid on this front last time I looked at it, which was quite a while ago. But yeah, if the question was more general-- periodic resync should guard against controller bugs (but controller authors should probably treat anything that turns up as a bug and fix it).
 

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-api-machinery@googlegroups.com.

Timo Reimann

unread,
Sep 10, 2017, 4:21:53 PM9/10/17
to K8s API Machinery SIG
On Friday, September 8, 2017 at 6:33:07 PM UTC+2, lavalamp wrote:
I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.

Not exactly sure I'm understanding this point: Does it mean that I (as the controller implementor) am responsible for issuing a relist if I detect a connection issue with my client? In particular, does that hold for client-go?

 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Daniel Smith

unread,
Sep 11, 2017, 12:29:05 PM9/11/17
to Timo Reimann, K8s API Machinery SIG
On Sun, Sep 10, 2017 at 1:21 PM, 'Timo Reimann' via K8s API Machinery SIG <kubernetes-sig...@googlegroups.com> wrote:
On Friday, September 8, 2017 at 6:33:07 PM UTC+2, lavalamp wrote:
I think there is value in tests to make sure the relisting code gets exercised, since networks do occasionally drop connections at inconvenient times.

Not exactly sure I'm understanding this point: Does it mean that I (as the controller implementor) am responsible for issuing a relist if I detect a connection issue with my client? In particular, does that hold for client-go?

If you use the standard tools, the relist will happen automatically if it is necessary; but since that is relatively rare, it's good to make sure it is exercised in tests in case it exercises a corner case in your controller code.
 

 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

Maxim Ivanov

unread,
Dec 9, 2018, 8:49:24 AM12/9/18
to K8s API Machinery SIG
With standard SharedInformers, do relists happen at all? I can't find any List calls in Reflector's code apart from initial call when reflector is setup. In other words, if Resync is and insurance for controller's code, what is currently recommended way to insure for missed Watch events?
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Clayton Coleman

unread,
Dec 9, 2018, 9:31:31 AM12/9/18
to Maxim Ivanov, K8s API Machinery SIG
Regarding missed watch events, there should be no such thing.  If watch events are broken, all of Kubernetes is horribly broken (it would be like the database returning incorrect transactional results).  We do occasionally have bugs at higher levels that present this way, but they are typically very obvious.  We don’t do relists because we have the confidence in the data store, but earlier in the project we were less confident.

Now, if you are using an aggregated api not backed by etcd, you may need such behavior, and it may be something we can make easier to set on shared informers 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

Maxim Ivanov

unread,
Dec 9, 2018, 10:22:59 AM12/9/18
to K8s API Machinery SIG
There is a quite a big chunk of logic in shared informers around DeleteFinalStateUnknown , is it an artifact from times when Watch events were not considered reliable and now can be removed?

lig...@google.com

unread,
Dec 9, 2018, 10:32:13 AM12/9/18
to Maxim Ivanov, K8s API Machinery SIG
No, when a watch connection gets dropped, and an actual relist is required, and a previously present object is no longer present in the full relist, DeleteFinalStateUnknown is still applicable. 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

Clayton Coleman

unread,
Dec 9, 2018, 11:00:18 AM12/9/18
to lig...@google.com, Maxim Ivanov, K8s API Machinery SIG
For more color, for performance reasons the apiserver has an optional watch cache (enabled by default for many kube resources and distros, but not all) that has historically had bugs that cause data inconsistency at a higher rate than etcd (I’m not aware of any major ones in etcd beyond the HA etcd 2->3 migration issue Jordan helped track down).  That was what I was alluding to in the previous email, where we sometimes find bugs in our own caches that can impact informers.

So in short:

1. Re-list should not be necessary against kube apis
2. If you have aggregate apis backed by something besides etcd and k8s.io/apiserver code and you can’t make watch reliable, relist may be necessary (although unreliable watch is probably a worse idea than telling consumers you don’t support watch at all and forcing the informers to relist)
3. Even with reliable watch, if you are disconnected from an apiserver for some period of time you may not be able to watch because the server has expired that history, in which case a relist happens automatically and you will get things like deletionfinalstateunknown
4. If you crash, you will not even get deletionfinalstateunknown which is wheee DeltaFIFO and it’s “sync list” code comes into play - if you are trying to reconcile kube state to some external store it’s up to you to reconstruct that.

Aren’t distributed systems fun?

Also, this reminds me we need to come back to our “syncing to external systems” docs and do another review - it’s been a while.

Daniel Smith

unread,
Dec 9, 2018, 11:18:41 AM12/9/18
to Clayton Coleman, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
The purpose of the resync feature is no longer to recover from bugs, it's to let you poll an external system.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/6e446397-4fb6-4c95-afaa-97f8920fd95a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/303e649c-b72d-43ef-b16e-e3e604f18964%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/8D326120-2918-4BCE-9BDF-E246701AD6E6%40google.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.

Clayton Coleman

unread,
Dec 9, 2018, 11:27:36 AM12/9/18
to Daniel Smith, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
To recover from api machinery bugs yes.

It’s current primary purpose is to help you recover from your own bugs.  Which is the vast majority of new controller issues.

Maxim Ivanov

unread,
Dec 9, 2018, 11:46:12 AM12/9/18
to K8s API Machinery SIG
Thanks, I've opened https://github.com/kubernetes/kubernetes/pull/71893 with small correction of docs about Reflector's resync param, hopefully it is accurate.

Daniel Smith

unread,
Dec 9, 2018, 2:24:53 PM12/9/18
to Clayton Coleman, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
I'm not really convinced that's a good backup plan--if your controller gets it wrong once, why won't it continue getting it wrong?

I think exhaustive testing is there way to go there. At the very least, make a metric that increments when you perform an action on a resync, so you can know if there's a bug or not...

Clayton Coleman

unread,
Dec 9, 2018, 7:23:12 PM12/9/18
to Daniel Smith, Jordan Liggitt, Maxim Ivanov, K8s API Machinery SIG
I dunno, all the bugs I ever see in controllers are race conditions, and sync eventually fixes them.

mspr...@us.ibm.com

unread,
Dec 9, 2018, 9:21:53 PM12/9/18
to K8s API Machinery SIG
But resync will not find something that (needs to be deleted downstream because the API object is already gone).

My bottom line is: write your controller to use requeuing correctly and synchronize state with other stuff correctly, and forget about resync.

Regards,
Mike

Max Ivanov

unread,
Dec 10, 2018, 3:27:48 AM12/10/18
to mspr...@us.ibm.com, K8s API Machinery SIG
Let's say it is ingress controller and it opens watchers for Ingress and Secret resources. It is totally possible that event on Ingress watcher is delivered/prcoessed first and followed up by an event on a  Secret watcher. In this case processing of Ingress event could not find referenced Secret.

Instead of having "known but errored during processing state" for Ingress, so that it can be found on a subsequent event on Secret and finished processing, controller author might simply rely on resync period which carpet-bombs all sorts of problems like that.

--
You received this message because you are subscribed to a topic in the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-sig-api-machinery/PbSCXdLDno0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubernetes-sig-api-m...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

Mike Spreitzer

unread,
Dec 10, 2018, 12:39:13 PM12/10/18
to Max Ivanov, K8s API Machinery SIG
You definitely do _not_ want to use resync to solve that problem.  You want a solution with low latency and low cost.

Perhaps one approach would be for the processing of a new Secret include enqueuing references to all the Ingress items that reference that Secret.

Regards,
Mike


Max Ivanov <ivanov...@gmail.com> wrote on 12/10/2018 12:27:33 AM:

> From: Max Ivanov <ivanov...@gmail.com>
> To post to this group, send email to kubernetes-sig-api-
> mach...@googlegroups.com.

> To view this discussion on the web visit
https://groups.google.com/

Mike Spreitzer

unread,
Dec 10, 2018, 12:43:11 PM12/10/18
to Max Ivanov, K8s API Machinery SIG
Perhaps I overlooked a more basic point here.  A controller should be level based, not edge based.  So, yes, you _do_ need the ability to represent the state of an Ingress being known but its Secret being unknown.  That is a state that can be imposed on the controller, you do not have a choice in this matter.

Regards,
Mike


kubernetes-sig...@googlegroups.com wrote on 12/10/2018 09:38:57 AM:

> From: "Mike Spreitzer" <mspr...@us.ibm.com>
> --
> You received this message because you are subscribed to the Google
> Groups "K8s API Machinery SIG" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to kubernetes-sig-api-m...@googlegroups.com.
> To post to this group, send email to kubernetes-sig-api-
> mach...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/
> d/msgid/kubernetes-sig-api-machinery/OF4E627848.FA2834D8-ON8825835F.
> 0060C864-8825835F.0060F2EF%40notes.na.collabserv.com.
> For more options, visit
https://groups.google.com/d/optout.

Max Ivanov

unread,
Dec 10, 2018, 2:51:22 PM12/10/18
to Mike Spreitzer, K8s API Machinery SIG
No every controller requires minimal response times.  Often simple "retry everything" is enough and just works.

It is akin to problem of waiting for dependencies. Some teams write InitContainers with finesse, where all dependencies are carefully evaluated and waited on until their condition satisfies given pod, proper progress report is written to stderr and joy of building a masterpiece settles down on developers. As an observation, these teams enjoy working with pubsub patterns and look down on any busy loops or polling.

Other teams just let their pod crash anytime things go wrong, letting orchestrator keep restarting whole swarm until it eventually settles on some stable state, where leafs in dependency tree come up first followed by all levels above one by one. 

Former usually keep finding bugs and fine tuning that intricate dance , latter don't even know such a problem exist. 

Mike Spreitzer

unread,
Dec 10, 2018, 3:13:18 PM12/10/18
to Max Ivanov, K8s API Machinery SIG
Another approach would be to requeue an Ingress whose Secret is unknown.  In the typical case where the root problem is just a bit of notification asynchrony, this will quickly and cheaply handle the problem; in other cases there is periodic reconsideration of just the blighted Ingress objects rather than all of them.

Regards,
Mike

Clayton Coleman

unread,
Dec 10, 2018, 11:50:03 PM12/10/18
to Max Ivanov, Mike Spreitzer, K8s API Machinery SIG


On Dec 10, 2018, at 11:28 AM, Max Ivanov <ivanov...@gmail.com> wrote:

No every controller requires minimal response times.  Often simple "retry everything" is enough and just works.

It is akin to problem of waiting for dependencies. Some teams write InitContainers with finesse, where all dependencies are carefully evaluated and waited on until their condition satisfies given pod, proper progress report is written to stderr and joy of building a masterpiece settles down on developers. As an observation, these teams enjoy working with pubsub patterns and look down on any busy loops or polling.

Other teams just let their pod crash anytime things go wrong, letting orchestrator keep restarting whole swarm until it eventually settles on some stable state, where leafs in dependency tree come up first followed by all levels above one by one. 

Former usually keep finding bugs and fine tuning that intricate dance , latter don't even know such a problem exist. 

Every successful controller is proven successful at the moment someone says “oh wow, sometimes it takes 5 minutes for X to happen, we should optimize that”.  In my experience in kube, that’s usually after it has been in production for several months.

Being edge driven is a premature optimization :). 

Not having to notice all sorts of weird edge cases until you need to is a *feature*.  :)

mspr...@us.ibm.com

unread,
Mar 28, 2019, 10:15:51 PM3/28/19
to K8s API Machinery SIG
I am not sure whether we are agreeing or disagreeing here.  If Max' controller does a requeue whenever it runs into trouble, this is a simple pattern that is easy to get right and will solve his problem with lower latency and cost than using resync.

On Monday, December 10, 2018 at 11:50:03 PM UTC-5, Clayton Coleman wrote:
> Max Ivanov <ivano...@gmail.com> wrote on 12/10/2018 12:27:33 AM:

> > To post to this group, send email to kubernetes-sig-api-
> > mach...@googlegroups.com.
> > To view this discussion on the web visit

> > d/msgid/kubernetes-sig-api-machinery/e8f4865d-1d8a-406e-
> > a435-2dcaf64d15d2%40googlegroups.com.
> > For more options, visit
https://groups.google.com/d/optout.
> --
> You received this message because you are subscribed to the Google
> Groups "K8s API Machinery SIG" group.
> To unsubscribe from this group and stop receiving emails from it,

> To post to this group, send email to kubernetes-sig-api-
> mach...@googlegroups.com.
> To view this discussion on the web visit

> d/msgid/kubernetes-sig-api-machinery/OF4E627848.FA2834D8-ON8825835F.
> 0060C864-8825835F.0060F2EF%40notes.na.collabserv.com.
> For more options, visit
https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.

Moushumi Das

unread,
Jul 20, 2019, 8:26:22 PM7/20/19
to K8s API Machinery SIG
Clayton had mentioned earlier in the thread
"Also, this reminds me we need to come back to our “syncing to external systems” docs and do another review - it’s been a while."
Is there such a doc?

andy....@gmail.com

unread,
Sep 1, 2021, 11:58:12 PM9/1/21
to K8s API Machinery SIG
Yes. I agree with this comments. That is what i am facing now.

在2017年9月8日星期五 UTC+8 上午12:44:33<lavalamp> 写道:
When 
* you are writing a controller which interacts both with Kubernetes and some other outside system (like, say, some cloud's load balancer API), and 
* the outside system doesn't offer a watchable interface, so
* you intend to poll the outside system (to make sure it stays correct / repair it), then
It makes sense to use resync to trigger the polling behavior, and the resync period can be chosen based on how often you want to reconcile.

...you just can't forget to also have a system that periodically finds all the objects you've created in the past but need to be deleted (i.e., don't leak).

On Thu, Sep 7, 2017 at 5:08 AM, David Eads <de...@redhat.com> wrote:
The resync period does more to compensate for problems in the controller code than to compensate for missed watch event.  We have had very few "I missed an event bug" (I can think of one in recent memory), but we have had many controllers which use the resync as a big "try again" button.

A resync is different than a relist.  The resync plays back all the events held in the informer cache.  A relist hits the API server to re-get all the data.

Since we introduced the rate limited work queue a few releases ago, the need to wait for a resync to retry has largely disappeared since an error during processing gets requeued on an incrementing delay.

Think of the resync as insurance.  You probably want it set more than a few minutes, less than a few hours.  If you're using requeuing correctly and avoiding panics, you aren't likely to benefit from rescanning all the data very often.
On Thu, Sep 7, 2017 at 5:24 AM, 'Timo Reimann' via K8s API Machinery SIG <kubernetes-sig...@googlegroups.com> wrote:
Hello,

I have a question regarding building controllers with the help of the informer framework (using Go, if that matters). Specifically, every informer must be given a resync period that defines the frequency by which all watched objects get a chance to be revisited by means of a full list. To my understanding, this is to guard against transient errors on the watch part of the ListWatch interface (though I'm not perfectly clear if this means one of client-side or server-side errors only).

Recommendations found in the wild as to the right resync period seem to gravitate towards "keep it rather high to avoid burdening the API server too much", and I often see values on the order of multiple minutes or even hours. However, this advice seems to render the resync period unsuitable as a means to compensate for watch errors for somewhat latency-critical controllers.

To add a real-world example: I am working on the Traefik Ingress controller which watches over a user-provided list of namespaces. For each namespace, Traefik retrieves Ingress, Service, and Endpoints objects to build up the routing table. Missing one of the objects means we cannot create a proper forwarding rule for the affected application(s) until the situation is remedied. So my tendency is to pick a rather low resync period (preferably no more than a minute) to make sure the routing gets fixed rather quickly, but that seems contrary to the common advice of what I should be doing.

I guess my questions boil down to the following:

1. Am I missing the point of the resync period somehow?
2. Are there any guidelines on how to choose the right value?
3. Is there maybe a way to measure the performance impact it has on the API server?

Thanks in advance for any advice, suggestions, and help in general.

Timo

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsubs...@googlegroups.com.

To post to this group, send email to kubernetes-sig...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-machinery+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages