Operator Architectures Best Practices

Lars Francke

unread,

Jan 15, 2021, 6:31:31 AM1/15/21

to operator-...@googlegroups.com

Hi everyone,

we started writing Operators and have looked at a whole bunch of blog posts and existing operators and we came across multiple different architecture patterns.

Most operators are similar in that they revolve around a reconciliation method (sometimes called differently e.g. sync) which is being triggered by informers/watchers/etc. every time some object or one of its dependencies changes in any way.

But the way these reconcile methods are implemented are different.

1) There are operators like the RedHat Prometheus Operator or the CNCF Strimzi Operator which have a "long-running" reconcile function.

This means they start with an event and only return when all steps needed to reach the target state are either done or there is an error.

If - for example - a Pod needs to be created these often wait until the API has registered it, they are picked up and running before continuing to the next action.
This means that changes that happen to the object while the reconciliation runs cannot interrupt the running reconciliation.

2) Then there are other operators that work on a more fine-grained basis. They move step-by-step in the target direction.

This means they’ll - for example - create a Pod and then end the reconciliation and requeue the event for some time later, either with a specified time (e.g. after 10 seconds) or rely on the action they took (e.g. Pod creation) triggering a new reconciliation automatically because they watch for owned resources. So the reconciliation starts at the beginning and looks at everything “with fresh eyes”.
This has the benefit that it can react to changes while it is working towards a target state.

Something I haven’t seen yet but I’m sure it exists:
3) A hybrid of the two approaches where some processes are not interruptible and are “waited for” during the reconciliation process and others are run one-at-a-time.

In addition there are different ways of implementing operators:

* Some have one big “driver” method that calls into other methods and pass in the required things via function arguments,

* others have an abstraction of a list of closures/methods (depending on programming languages) that all adhere to the same contract but are otherwise more or less independent of each other.

* The last variant that we found is one based on state machines but those don’t seem to fit operators perfectly (but this is mostly a gut feeling)

As we're just getting started we're interested in hearing what has worked for you and what hasn't. Any drawbacks or benefits for going one way or the other and so on? Best Practices to share.

Basically any kind of feedback is welcome.

Thanks and stay healthy,

Lars

Camila Macedo

unread,

Jan 15, 2021, 10:27:23 AM1/15/21

to Lars Francke, Operator Framework

HI @Lars,

I will try to share some docs and thoughts that I truly hope that helps you with that.

Regards the best practices over the controllers/reconciliation, I'd say the main one would be to develop idempotent solutions. The idea is that when you reconcile a resource of our API(crd/cr) you are looking for to do a collection of operations until it has the desired state on the cluster. I want to suggest you check doc Groups and Versions and Kinds, oh my![1] to understand better what APIs means and be able to design better solutions. PS.: I am linking the master branch because it was updated recently and its changes are not published yet in https://book.kubebuilder.io/. Other good K8S docs that might help you understand it better can be found here[2]

Regards the layout of the projects, you will see that you will find many projects doing different things. To share a little of context, notice that the layout scaffolded by Kubebuilder as the SDK changed through the latest years. And then, we can find out projects and maintainers who decide not to follow any standard adopted by these tools which have the goal to help you develop the projects. You can find other tools which encourage different approaches too. Note that the tools will scaffold the files. However, controller-runtime[3] is the k8s go libraries for building Controllers.

As you can see both highlighted tools leverage on controller-runtime. However, I'd like to recommend checking out SDK and using its latest version to create the projects. SDK(versions >=1.0+) and Kubebuilder will provide to you the same standard layout. Nonetheless, SDK provides helpers to help you with some actions such as integrating your project with OLM[11] and publishing it OperatorHub.io[5] and/or OCP catalogue, scorecard to test your projects as adopt other languages too. You will find more about that into the SDK docs(https://sdk.operatorframework.io/).

Also, to have a better context over the advantages to use SDK as its differences over Kubebuilder, I'd recommend you check the comment[6] and the blog post Operator SDK Reaches v1.0[7].

And then, note that after you use the tools to create your projects and APIs, you can customise it as please you. Furthermore, I would suggest trying to not deviate from the layout proposed. I mean, you can add all that you wish on top. But, if you switch the places of what is scaffolded by default, it might get yourself in a more challenging position in the future to keep your project maintainable. You might be no longer able to follow up a straightforward process to solve tech-debts and upgrade your projects as described in its docs. It is also possible to no longer take advantage of some options and helpers provided by the tools. In this way, before you start to do customisations, I'd advise you to understand the proposed layout structure. See the doc What’s in a basic project?[8].

I'd like to encourage you to follow up on the tutorial "Golang Based Operator TutorialGolang Based Operator Tutorial"[9] before getting started to developing your operators. You will not spend more than 1 hour on that, and it will give a better idea of how things work. PS.; we have a PR open to improve this tutorial, then you might find out easier to follow up that as well. See the preview of the changes here[10].

Finally, I'd like to share that in IHMO, the best approach is to persuade the DDD principles. To illustrate my thoughts, let’s think about the classic scenario where the end goal is to have an operator which manages an application and its database. Then, one resource could represent the App, and another one could represent the DB. By having one CRD to specify the App and another one for the DB, we would not be hurting concepts such as encapsulation, the single responsibility principle, and cohesion. Damaging these concepts could cause unexpected side effects, such as difficulty extending, reuse, or maintenance, only to mention a few. Additionally, I'd not develop a controller such as an install.go that reconcile all. Instead of that, I'd implement a controller to synchronize the App and another for the DB CR's. It is only an example to try to simplify my thoughts, see that OLM allows you to determine a relation dependency between operators which might help you out better to address your needs. More info can be found in the OLM doc Dependency Resolution.

Please, feel free to raise issues in the repos with your questions, problems and suggestions.

Your collaboration with the projects is very welcome too.

[1] - https://github.com/kubernetes-sigs/kubebuilder/blob/master/docs/book/src/cronjob-tutorial/gvks.md

[2] - https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/

[3] - https://github.com/kubernetes-sigs/controller-runtime

[4] - https://github.com/operator-framework/operator-sdk

[5] - https://operatorhub.io/

[6] - https://github.com/operator-framework/operator-sdk/issues/1758#issuecomment-517432349

[7] - https://www.openshift.com/blog/operator-sdk-reaches-v1.0

[8] - https://book.kubebuilder.io/cronjob-tutorial/basic-project.html

[9] - https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/

[10] - https://deploy-preview-4021--operator-sdk.netlify.app/docs/building-operators/golang/tutorial/

[11] - https://github.com/operator-framework/operator-lifecycle-manager

[12] - https://olm.operatorframework.io/docs/concepts/olm-architecture/dependency-resolution/

Cheers,

CAMILA MACEDO

SR. SOFTWARE ENGINEER

RED HAT Operator framework

Red Hat UK

IM: cmacedo

--
You received this message because you are subscribed to the Google Groups "Operator Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to operator-framew...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/CAD-Ua_hi9sxQ4MrGYGMMDA87iYfo3wXnwoX60dShjs_DivvS-Q%40mail.gmail.com.

Lars Francke

unread,

Jan 18, 2021, 8:16:19 AM1/18/21

to Camila Macedo, Operator Framework

Hi Camila,

thank you for the thorough reply.

There were some resources in there that I hadn't seen before and I'll look through them this week.

Especially around the OLM of which I hadn't heard before.

I didn't mention any programming languages or frameworks on purpose because I believe they are not relevant to my question as I'm asking about a more general architecture pattern.

I'm basically wondering how people implement their `reconcile` method and we see multiple styles of doing so as outlined in my initial post: Either one step at a time with requeueing or one big reconciliation end-to-end.

Cheers,

Lars

To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/CACQ0tdB_QtMWr%2Bj1%3DbZO7ZP%3D6OKX%2Bti1DhZmTDxu_EPtQOtj%2Bg%40mail.gmail.com.

Camila Macedo

unread,

Jan 19, 2021, 6:50:42 AM1/19/21

to Lars Francke, Operator Framework

Hi @Lars,

Following some comments, inline. I hope that it helps you with.

> I didn't mention any programming languages or frameworks on purpose because I believe they are not relevant to my question as I'm asking about a more general architecture pattern.

See that following the operator's pattern; you will create controllers(controller-runtime) with a reconcile function responsible for synchronizing the resources until a desired state on the cluster. Not necessary but usually, the primary resource managed by these controllers are from your own APIs which Extend the Kubernetes API with CustomResourceDefinitions. So, the docs shared previously will reveal why to create APIs at all?. Note that, as shared the golden rule is to develop idempotent solutions. PS.: IHMO since the reconcile() perform resources reconciliations then, by understanding the APIs concepts and persuading some principles it can result in more maintainable controllers/reconciliation implementations as well. (see the latest paragraph of my previous email)

> I'm basically wondering how people implement their `reconcile` method and we see multiple styles of doing so as outlined in my initial post: Either one step at a time with requeueing or one big reconciliation end-to-end.

Shows that you are looking for to understand how the controllers/reconciliations work.

Then, I will try to summarise it here with some examples.

Understanding reconcile()

It works like a loop, and it does not stop until all conditionals match with its implementation which means that; you will need to add to it all operations required to accomplish the desired state on the cluster. Now, check the pseudo-code in here to know it better and then, look :

To stop the Reconcile, use: ( don't requeue ) return ctrl.Result{}, nil
To requeue the reconcile due to error, use: return ctrl.Result{}, err
To requeue for any reason other than an error, use: return ctrl.Result{Requeue: true}, nil
To requeue after X time, use for example: return ctrl.Result{RequeueAfter: nextRun.Sub(r.Now())}, nil

The return is up to your requirements, solution and operations. E.g Usually for failures you will use (return ctrl.Result{}, err) unless that error means that you should wait for x time (return ctrl.Result{RequeueAfter: value}) to try to reconcile again or stop the reconciliation (return ctrl.Result{}, nil). To reload your variables, you might use (return ctrl.Result{Requeue: true}, nil) at the same way that it can be used for any conditional/operation which you would like to re-start the "loop" from the beginner. For more details, check https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile.

Now, let's check an example. It might answer your question.

Let's examine the reconcile() of the Memcached sample suggested in the tutorial shared with you previously (see here). Find on it the return `ctrl.Result{Requeue: true}, nil` to retrigger the reconcile after creating the Deployment. It happens because we want to check the quantity of instances `f *found.Spec.Replicas != size {` (see here ).

To do it, we need to have the latest state of the resource on the cluster, so we are re-queueing it to get its value from the cluster. The same requirement could then be solved by re-fetching(client.GET) the resource before the conditional instead of using the `ctrl.Result{Requeue: true}, nil` ` as well. Both approaches are valid.

Is a good idea to re-queueing to re-load the values always?

Beware that if you are doing it to update the resource then, you might face issues in the logs such as:

> the object has been modified; please apply your changes to the latest version and try again

Usually, it happens when you try to update a resource, and it changed between the time you GET the resource values from the cluster and the action to UPDATE it. In this way, I'd like to suggest re-fetch it instead of updating its state in your local code variables due to "requeueing" when the goal is to update a resource.

Note that the reconciliation is also called by:

Events: The tutorial suggested ("Golang Based Operator Tutorial") demonstrates that when events such as; to delete, to update and to create a resource observed by the controller are raised, the reconciliation will be re-trigged as well. (see here that changes in the Memcached CRs and the Deployments managed by this controller will start the reconciliation). For further information, see https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/builder#Builder.
Manager: The Manager will manage all controllers and call their reconcile every 10 hours by default. See the attribute SyncPeriod in https://godoc.org/sigs.k8s.io/controller-runtime/pkg/manager#Options for further information.

I hope that I could cover all points raised and provided the required references.

However, please do not hesitate to raise your questions here and/or in the repositories.

Lars Francke

unread,

Jan 25, 2021, 6:55:30 AM1/25/21

to Camila Macedo, Operator Framework

Hi Camila,

thanks again for the answer, sorry it took me a bit longer to get to it.

Reading your reply I'm afraid I didn't make myself very clear. I'll try again and hope it helps.

We're not even using Go as our programming language of choice, we're currently developing in Rust but that said I don't believe my question has anything to do with the language or its specifics.

I looked at around 20 open source operators in three languages (Java, Go and Rust). All of them have the same boilerplate but in the end you need to implement a `reconcile` function in all of these languages.

My only question is if there are any known best-practices in how to implement this method because I found basically two distinct styles of writing operators.

1) Some Operators have "long running" reconcile functions where each reconcile would do multiple actions and if one depends on the outcome of a previous one it would wait within the reconcile function for the result and not return.

Pseudo-code:

func reconcile(resource) {
waitFor(maybeDoAction1(resource));
waitFor(maybeDoAction2(resource));
}

2) Other operators have a different style. They do one action and then requeue. In the next iteration they'll see that this action doesn't need to be done anymore so they can proceed one step further.

Pseudo-code:

func reconcile(resource) {
if (needToDoAction1(resource)) {
doAction1(resource);
return requeue;
}

if (needToDoAction2(resource)) {
doAction2(resource);
return requeue;
}
}

The pseudo-code is just that, it obviously lacks a lot of details but the general pattern should be caught by it.

If you have any opinions on those two styles (or a mixture) I'd love to hear them.

Cheers,

Lars

--
You received this message because you are subscribed to the Google Groups "Operator Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to operator-framew...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/CACQ0tdCb4NuQouOXSV8ruaK6pHXaMrGaPF9nXbOA%2BZA2NJ5uZg%40mail.gmail.com.

Ken Sipe

unread,

Jan 25, 2021, 7:39:06 AM1/25/21

to Lars Francke, Camila Macedo, Operator Framework

Hey Lars! Welcome to the community!

Some thoughts here…

1. There is likely many code examples out there… with varying degrees of maturity (which from the sounds of it you recognize and is in part a driver of your questions)

2. There has been additional maturity in our the Go operatorSDK AND in the Kubernetes community

3. I would guess that from this mail list, there is going to be a bias influenced by Go and the controller-runtime. This is the answers and references Camilla provided (thanks Camilla… solid stuff!).

4. Controller-runtime is a distillation of ideas and best practices and evolution of thought in the controller / operator space in Kubernetes. There is good reason to follow it’s patterns as I don’t know of another framework or language that is as leading edge (it could be lack of awareness on my part, but it is also hard to imagine as the team working on controller-runtime are working close with api-machinery etc in the the Kubernetes core)

With all of that as a backdrop… here are some additional thoughts that are abstracted from this framework/ lib that hopefully aligns with your question…

1. The core of a controller is it loops a) what is expected, b) what is in the cluster and c) invoke changes to align what is to what is declared

2. It really doesn’t matter what style of loop you choose… What is important is the following:

3. Controllers in general should expect that they are working collaboratively in the cluster. For instance, the ReplicaSet controller creates pods by calling the api-server (it doesn’t actually create the pod directly… it POSTs to api-server against pod API. So controllers should in general expect that changes on a given resource could have happened from another agent (human or controller).

4. Based on this, you need to confirm you are making decisions off the current resourceVersion and don’t get in a chaotic loop against another agent

5. It is best to not hammer the api-server, so the the controller-runtime lib has some cacheing to reduce this… which has its own set of rules. You have to update the cache or be aware when it is invalid etc. Also, you don’t want to “update” an object out of the cache as it affects other threads.

6. You mentioned “time” based loops and I would definitely avoid that… time is great for timeouts… but terrible for waiting for conditions… events!

7. A major driver of practice is driven by Kubernetes.. since 1.18 server-side apply (SSA) has been implemented. This adds additional concerns to be aware of… Based on this, the metadata of a resource tracks ownership of friends within the resource. You want to be aware if your controller is modifying a field that is owned by another controller and learn how to collaborate in the cluster

8. Currently off a branch in client-go are changes to support SSA better (it would be worth a review to get ideas for other languages).

9. Along with a cache, you want to focus on “watches” vs repeated GET requests against the api-server.

I’m sure I’m missing something:)

The core is regarding your question is:

1. You will likely get answers that are biased towards the responses experiences or their favorite pattern / tool

2. I’m less attached to a pattern, but there are some core concerns to be aware of

3. Many of these concerns are captured in controller-runtime… but even with controller-runtime many concerns are still in the hands of the implementer

Good luck!!

Ken

To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/CAD-Ua_i0siuSFmgiGYY1b-moZOj6j%3DdUmTOkdf-t%2BHU%2B_FVTBg%40mail.gmail.com.

Ken Sipe

unread,

Jan 25, 2021, 8:43:53 AM1/25/21

to Lars Francke, Camila Macedo, Operator Framework

I realized after providing an answer that I said almost nothing about the loop styles in question:)

Which is likely an indication of my Meh:) Here are some specific thoughts.

1. Although possible and easy to understand… It seems immature to have a loop that does it all (which is generally how we introduce the idea). I would not be for a forever loop that queries the state of things and reacts to that.

2. Put another way.. there needs to be a “work queue”. A queue of work, which is a collection changes in the system. At that point… do you have 1 queue for an object type? Or a generic queue of all the types? This is likely first point of design which will drive some of decisions

3. Another influence on type of loop is… are you solve just your controller interests… or building out something generic to be used by other controllers?

4. I would be against a thread per event or change (you don’t know the depth of queue/work)

5. You could have a loop per type of interest perhaps… or a thread pool per type… but that doesn’t seem like a great generic solution

6. Then you get into… what to do with multiple changes on the same object within the same queue… do you care to track each change… or do you remove dups and work on the last change since the last loop entry. Generally based on eventual consistency, you would favor the last change only.

7. You have to protect the loop… you can’t be deadlocked on something… how do you handle timeouts… how do you handle errors

8. You want to avoid long running functions within the loop…

9. We are dealing with the control plane…

10. You need to be able to handle your controller crashing and restarting… the loop should make no assumptions regarding that.

Again… just 1 man’s thoughts…

Good luck,

Ken

Camila Macedo

unread,

Jan 25, 2021, 9:48:53 AM1/25/21

to Ken Sipe, Lars Francke, Operator Framework

HI @Ken,

Really thank you for great help here and input your thoughts as well.

Hi @Lars,

I will try to answer your specific questions based on my personal POV and highlighting some previous comments to make it easier.

Also, I'd like to suggest to you check Operator Best Practices(https://operator-framework.github.io/community-operators/best-practices/).

Following my comments inline whcih I hope that helps you with.

> 1) Some Operators have "long running" reconcile functions

From my first email:

Finally, I'd like to share that in IHMO, the best approach is to persuade the DDD principles. To illustrate my thoughts, let’s think about the classic scenario where the end goal is to have an operator which manages an application and its database. Then, one resource could represent the App, and another one could represent the DB. By having one CRD to specify the App and another one for the DB, we would not be hurting concepts such as encapsulation, the single responsibility principle, and cohesion. Damaging these concepts could cause unexpected side effects, such as difficulty extending, reuse, or maintenance, only to mention a few. Additionally, I'd not develop a controller such as an install.go that reconcile all.

And from my second email:

See that following the operator's pattern; you will create controllers(controller-runtime) with a reconcile function responsible for synchronizing the resources until a desired state on the cluster. Not necessary but usually, the primary resource managed by these controllers are from your own APIs which Extend the Kubernetes API with CustomResourceDefinitions. So, the docs shared previously will reveal why to create APIs at all?. Note that, as shared the golden rule is to develop idempotent solutions. PS.: IHMO since the reconcile() perform resources reconciliations then, by understanding the APIs concepts and persuading some principles it can result in more maintainable controllers/reconciliation implementations as well.

IHMO: by understanding the concepts shared and due try to persuading some principles then, not necessarily true but usually, we could be able to at last midgate the scenario where we will end up needing "big reconciliations" methods as its probable side effects such as; to be hard to understand, to read and consequently likely with lower maintainability.

Regarding:

> multiple actions, and if one depends on the outcome of a previous one, it would wait within the reconcile function for the result and not return.

VERSUS

> They do one action and then require. In the next iteration, they'll see that this action doesn't need to be done anymore so they can proceed one step further.

Is the solution idempotent?

If yes, they are following the golden rule.
If not, then, I'd say that the solution is not following its premise.

IHMO: If I properly understood your question/scenario here and by looking for simplicity; I'd personally prefer the return approach which shows follow up better the design proposed by the controller-runtime since the goal is to ensure a final desired stated on the cluster with an idempotent solution. So, why not restart the reconciliation immediately, and/or after X time and etc until the goal to be accomplished? I mean, why not use return after X time and unsure all again instead of waitForActionN? Have you any reason to prefer this approach to address your specific need? Anyway, it might have some edge scenarios, operations where the waitFor whiteout a return might be preferable. However, it shows still be up to check the specific solution needs, requirements and operations.

Cheers,

CAMILA MACEDO

SR. SOFTWARE ENGINEER

RED HAT Operator framework

Red Hat UK

She / Her / Hers

IM: cmacedo

Evan Cordell

unread,

Jan 25, 2021, 11:24:16 AM1/25/21

to Camila Macedo, Ken Sipe, Lars Francke, Operator Framework

There's already a lot of good information in this thread, so I'll just add the ways that I tend to think about the problem:

TL;DR: it is application-specific, but the current tooling and abstractions available to you as an author mean that it’s likely simpler to do small amounts of work and exit early / requeue.

In the language of control theory, your controller is a closed-loop feedback system where your reconciler builds a model of the system before acting upon that model.

Any change to the cluster state is a potentially important change to that model. Most controllers today are written so that they need to re-construct their current model from a set of (cached) resources, and I haven’t seen any controllers written that can cancel a running reconciliation if a cluster event produces some state-significant change to that model. It will often be simpler to exit and reconstruct your model frequently to ensure you are not dealing with an outdated model.

This hints at tooling / libraries / patterns that should likely exist but don’t yet, such as tooling to keep a model up-to-date in memory (instead of keeping a set of cached objects from which the model can be constructed). But even in the absence of something like that, you might choose to rebuild your model more or less frequently for application-specific reasons. Maybe building your model is time or resource intensive, or maybe the effects of working from an outdated model are minimal.

Some of this even bleeds into simple, low-level decisions in a controller: if an `update` call fails for a resource, do you retry with an updated version of that one resource? Or do you need to recreate a larger model with that updated resource?

The other lens I find valuable is that of process scheduling.

When using typical kubernetes controller machinery, you generally have one process handling one resource of a particular type at a time. Spending more time in your reconciliation can lead to a bad UX, where your managed resources may appear stuck, waiting for the controller to come be released so that it can get to them.

Requeueing frequently is more or less yielding frequently from coroutines, which can lead to more concurrency / more apparent work being done.

Hope that helps!

To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/CACQ0tdBnh5oVXzSYGU%3D%3DWJy7NPPtQH0487Rx10Q1EW-MLEHbRw%40mail.gmail.com.

--

Evan Cordell

318.471.6759

Lars Francke

unread,

Jan 26, 2021, 5:45:18 AM1/26/21

to Ken Sipe, Camila Macedo, Operator Framework

Thank you for the warm welcome and the response. That's a lot to unpack. My answers are inline.

On Mon, Jan 25, 2021 at 2:43 PM Ken Sipe <ken...@gmail.com> wrote:

I realized after providing an answer that I said almost nothing about the loop styles in question:)
Which is likely an indication of my Meh:) Here are some specific thoughts.

That's already helpful though.

I know how ecosystems evolve and I am just trying to find out if some best-practices have emerged over time.

From what I read and what Camila has posted the answer is "yes" but only up to a point.

1. Although possible and easy to understand… It seems immature to have a loop that does it all (which is generally how we introduce the idea). I would not be for a forever loop that queries the state of things and reacts to that.
2. Put another way.. there needs to be a “work queue”. A queue of work, which is a collection changes in the system. At that point… do you have 1 queue for an object type? Or a generic queue of all the types? This is likely first point of design which will drive some of decisions
3. Another influence on type of loop is… are you solve just your controller interests… or building out something generic to be used by other controllers?
4. I would be against a thread per event or change (you don’t know the depth of queue/work)
5. You could have a loop per type of interest perhaps… or a thread pool per type… but that doesn’t seem like a great generic solution
6. Then you get into… what to do with multiple changes on the same object within the same queue… do you care to track each change… or do you remove dups and work on the last change since the last loop entry. Generally based on eventual consistency, you would favor the last change only.
7. You have to protect the loop… you can’t be deadlocked on something… how do you handle timeouts… how do you handle errors
8. You want to avoid long running functions within the loop…
9. We are dealing with the control plane…
10. You need to be able to handle your controller crashing and restarting… the loop should make no assumptions regarding that.

These are all good points, yes.

Most of what I read leads me towards a "take one step at a time" approach which should help especially with number 10.

Unfortunately, I only started taking notes towards the end but should someone else be interested here they are: <https://docs.google.com/spreadsheets/d/1Dl03_JqWLn4osf1hO0d2s-yRDmsuMXvG3kqt3WQSxTk/edit?usp=sharing>

There's a great variety out there.

* Some operators check for "event types" e.g. added, updated, deleted

* Some operators requeue others don't

* Some operators use the status object to track progress (I know this is frowned upon but I understand why it's done)

* Some operators have additional "command" objects (e.g. a FooRestart object), related issue https://github.com/kubernetes/kubernetes/issues/72637

* All operators I've seen use some kind of work queue, I haven't seen any debouncing/deduping but I have to admit that I didn't look too deply for this

* Some use multiple threads some don't....

This has been a very interesting exercise.

Cheers,

Lars

Again… just 1 man’s thoughts…

Good luck,
Ken

On Jan 25, 2021, at 6:39 AM, Ken Sipe <ken...@gmail.com> wrote:

Hey Lars! Welcome to the community!

Some thoughts here…

1. There is likely many code examples out there… with varying degrees of maturity (which from the sounds of it you recognize and is in part a driver of your questions)
2. There has been additional maturity in our the Go operatorSDK AND in the Kubernetes community
3. I would guess that from this mail list, there is going to be a bias influenced by Go and the controller-runtime. This is the answers and references Camilla provided (thanks Camilla… solid stuff!).
4. Controller-runtime is a distillation of ideas and best practices and evolution of thought in the controller / operator space in Kubernetes. There is good reason to follow it’s patterns as I don’t know of another framework or language that is as leading edge (it could be lack of awareness on my part, but it is also hard to imagine as the team working on controller-runtime are working close with api-machinery etc in the the Kubernetes core)

Looking at controller-runtime and the operator-sdk/kubebuilder has definitely helped and similar libraries exist in Rust. I opted to post my question here because this community is obviously way more active and mature and because I believe that my question is a general architecture question that you will face in any language.

Lars Francke

unread,

Jan 26, 2021, 5:47:45 AM1/26/21

to Camila Macedo, Ken Sipe, Operator Framework

Camila,

thanks again for the reply.

> IHMO: If I properly understood your question/scenario here and by looking for simplicity; I'd personally prefer the return approach which shows follow up better the design proposed by the controller-runtime since the goal is to ensure a final desired stated on the cluster with an idempotent solution. So, why not restart the reconciliation immediately, and/or after X time and etc until the goal to be accomplished? I mean, why not use return after X time and unsure all again instead of waitForActionN? Have you any reason to prefer this approach to address your specific need? Anyway, it might have some edge scenarios, operations where the waitFor whiteout a return might be preferable. However, it shows still be up to check the specific solution needs, requirements and operations.

Yes, I think I came to the same conclusion. It makes it much easier to reason about the current state and to react to changes if you only do one step at a time.

This does not mean we can't have a combined approach where some operations should be waited for but as Ken also mentioned we need to deal with crashed/restarted operators anyway.

This has been very helpful,

Lars

Lars Francke

unread,

Jan 26, 2021, 6:32:16 AM1/26/21

to Evan Cordell, Camila Macedo, Ken Sipe, Operator Framework

Evan, thank you. I've commented inline.

On Mon, Jan 25, 2021 at 5:24 PM Evan Cordell <cordel...@gmail.com> wrote:

There's already a lot of good information in this thread, so I'll just add the ways that I tend to think about the problem:

TL;DR: it is application-specific, but the current tooling and abstractions available to you as an author mean that it’s likely simpler to do small amounts of work and exit early / requeue.

I agree. That's why I was a bit surprised to see so many operators using a different model and hence the reason to start this thread.

In the language of control theory, your controller is a closed-loop feedback system where your reconciler builds a model of the system before acting upon that model.

Any change to the cluster state is a potentially important change to that model. Most controllers today are written so that they need to re-construct their current model from a set of (cached) resources, and I haven’t seen any controllers written that can cancel a running reconciliation if a cluster event produces some state-significant change to that model. It will often be simpler to exit and reconstruct your model frequently to ensure you are not dealing with an outdated model.

This hints at tooling / libraries / patterns that should likely exist but don’t yet, such as tooling to keep a model up-to-date in memory (instead of keeping a set of cached objects from which the model can be constructed). But even in the absence of something like that, you might choose to rebuild your model more or less frequently for application-specific reasons. Maybe building your model is time or resource intensive, or maybe the effects of working from an outdated model are minimal.

Some of this even bleeds into simple, low-level decisions in a controller: if an `update` call fails for a resource, do you retry with an updated version of that one resource? Or do you need to recreate a larger model with that updated resource?

The other lens I find valuable is that of process scheduling.

When using typical kubernetes controller machinery, you generally have one process handling one resource of a particular type at a time. Spending more time in your reconciliation can lead to a bad UX, where your managed resources may appear stuck, waiting for the controller to come be released so that it can get to them.

Requeueing frequently is more or less yielding frequently from coroutines, which can lead to more concurrency / more apparent work being done.

Yes, this makes sense.

Some controllers are implemented in a way that would could cause trouble if they are ever interrupted. To me the "long running" reconcile seems like an unnecessary optimization as you need to deal with the unhappy case anyway.

Thank you everyone for the opinions and examples.

It has helped me a lot.

Cheers,

Lars

Camila Macedo

unread,

Jan 26, 2021, 6:41:48 AM1/26/21

to Lars Francke, Evan Cordell, Ken Sipe, Operator Framework

Hi Lars,

I am very glad that we could help you with this subject.

Please, feel free to raise issues in the repos with your questions, problems and suggestions.

Your collaboration with the projects is very welcome too.

Cheers,

CAMILA MACEDO

SR. SOFTWARE ENGINEER

RED HAT Operator framework

Red Hat UK

She / Her / Hers

IM: cmacedo

Reply all

Reply to author

Forward