Best way to put error message in custom CRD

389 views
Skip to first unread message

Morgan

unread,
Aug 19, 2022, 11:00:04 AM8/19/22
to Operator Framework
Hello

I'm currently creating a custom CRD using Operator SDK (Using Go). It works perfectly ! thanks for all the work done !

I just have a question and I'm unable to find any answers. Perhaps I just do it wrong and need to have insights about how to do it correctly !

When a user applies a my custom CRD, then the Operator takes care of it. But in the case of an error (The script called by the operator returns an error), I want to put the complete error message in the "Condition > Message" property of my custom CRD status. After applying it, my user will know why the CRD is in error state.

It work perfectly if the error message does not change between Reconcile call ...... In my case, the third party I call return error message with Guid and Timestamps, so, when I update my CRD, the reconcile loop is triggered multiple times per second instead of retrying exponentially. 


Currently I do not put the error inside the Condition > Message property. But this is want I want to achieve.

What is the best way to give CRD's user a good error message ? Currently, my user need to get the Operator Logs to be able to get the real and complete error message.

Thanks for your help

Bryce Palmer

unread,
Aug 19, 2022, 11:21:58 AM8/19/22
to Morgan, Operator Framework
Hi Morgan,

Using Go you should be able to get the error message as a string by using the error interface's `Error()` function (see: https://pkg.go.dev/builtin#error).

In your case I think it would be as simple as setting the message field in the `v1.Condition` you are creating like so:
```
Message: err.Error(),
```

I hope this helps!

Bryce Palmer

Software Engineer, Operator SDK

Red Hat

bpa...@redhat.com   


--
You received this message because you are subscribed to the Google Groups "Operator Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to operator-framew...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/7cb14dbd-e324-4547-9f21-4548b0889f08n%40googlegroups.com.


--

Bryce Palmer

Software Engineer, Operator SDK

Red Hat

bpa...@redhat.com   

David Lanouette

unread,
Aug 19, 2022, 11:28:50 AM8/19/22
to Morgan, Operator Framework
If I understand your question correctly, you are wondering how to not "reprocess" the request after you have set the error?

If that is the issue you are trying to solve, at the top of your Reconcile method, you can check if the condition has already been set.  If so, don't do any further work on it.


David Lanouette

Principal Software Engineer

David.L...@Redhat.com



On Fri, Aug 19, 2022 at 11:00 AM Morgan <morgan...@gmail.com> wrote:
--

Morgan Leroi

unread,
Aug 19, 2022, 12:04:48 PM8/19/22
to David Lanouette, Bryce Palmer, Operator Framework
Thanks for your answers.

No, I don’t want to reprocess it explicitly. As my script returned an error, then, my reconcile loop returns also an error (https://github.com/morganleroi/deploy-website-k8s-operator/blob/3af50b8457db55b83aa92e035067b77ed5872606/controllers/webapp_controller.go#L107). Then, my CRD is re-process with a correct exponetionnal duration between retry.

What I want is just to include the complete string error returned by my script, inside the Message property in my CRD. 

For example, If I set a “static”message like “An error happened” then It works perfectly.

 But in my case , the error message returned by my script change every time the script is called

"Description=This request is not authorized to perform this operation using this permission.
RequestId:413151f0-501e-0180-3ad2-b3de7c000000
Time:2022-08-19T13:48:22.3019908Z, Details: (none)"

Because of that, I think my CRD is alway modified because I chang again and again the property Message. And because of that, the reconcile loop is called with a high frequency (multiple times per sec), without the correct exponential retry.

What do I miss here ?

David Lanouette

unread,
Aug 19, 2022, 12:18:23 PM8/19/22
to Morgan Leroi, Bryce Palmer, Operator Framework
Your Reconcile method gets called every time the resource changes.  That includes when you set the Condition.  So, updating the Condition will cause the resource to get sent back to your controller.

To prevent an infinite loop, you can check if a Condition is already set in the resource.  If it has been set, do not update the resource again, just stop processing.  That should prevent it from calling your Reconcile method again.

I hope that makes sense.  If not, I can post some pseudo code.

David Lanouette

Principal Software Engineer

Bryce Palmer

unread,
Aug 19, 2022, 12:31:02 PM8/19/22
to David Lanouette, Morgan Leroi, Operator Framework
My apologies for misunderstanding the question.

What David is mentioning is correct. Doing a check to see if the Condition is already set and returning if it is stops the reconcile loop.

To help visualize, my understanding of what David is mentioning would be to put a bit of code similar to this:
```
if len(webAppCrd.Status.Conditions) > 0 {
  if webAppCrd.Status.Conditions[len(webAppCrd.Status.Conditions) - 1].Status == v1.ConditionFalse {
    // stop the reconciliation because the latest status is that reconciliation failed
    return ctrl.Result{}, nil
  }
}
```

This would go somewhere around https://github.com/morganleroi/deploy-website-k8s-operator/blob/3af50b8457db55b83aa92e035067b77ed5872606/controllers/webapp_controller.go#L52 in order to return early and not continue executing the rest of the Reconcile function.

Morgan Leroi

unread,
Aug 19, 2022, 12:31:47 PM8/19/22
to David Lanouette, Bryce Palmer, Operator Framework
Thanks David. Yes it make sense ! 

So, if I update my ressource with random value, then the Reconcile is called directly, but the behaviour I expect (ie the exponential retry) is only triggered because my reconcile loop is ending with a return containing an error value, right ? This is two different behaviours for two different cases.

If I check if the Condition is already set, how do I do the difference between both cases :
- My CRD is is error state, I previously set a Condition. I do not want another Reconcile until users apply the CRD with new values> I want to skip the reconcile.
- My CRD is in error state, I previously set a Condition but my users applied a change is the CRD …I want to run the reconcile !

How am I suppose to do the difference ? 

Thanks a lot for your help.

Alex Greene

unread,
Aug 19, 2022, 1:29:00 PM8/19/22
to Morgan Leroi, David Lanouette, Bryce Palmer, Operator Framework
Hello Morgan,

Thanks for starting this discussion. To summarize, you are trying to:
- Process a CR on updates
- Communicate an error that includes timestamps through a condition in the CR's `status.Conditions` array.

I'm going to frame this response assuming that you agree with the following principle:
- When a controller processes a CR, it should be driven by the CR's spec and behavior must be deterministic.

Based on the conversation thus far, the same error is reached continuously as you update the status of the CR. There are two approaches that immediately jump to mind.

1. You can introduce a GenerationChangedPredicate, as suggested in this issue. This is a heavy handed approach, your operator will only process spec changes, omitting changes to the metadata or status.

2. If option 1 is not acceptable, I encourage David's suggested approach in which the CR's status is only updated when the error is initially encountered. The drawback to this approach is that the timestamp will only signify when the issue was first encountered, which should already be included in the condition's LastTransactionTime. If the condition has a unique type, status, and reason, the update logic can be based on those fields. In practice this would look like:
- The operator processes the CR and updates its status.Conditions array to include the error condition.
- The operator immediately processes the CR again, the error is once again encountered but your controller does not update the message because the conditions type, status, and reason have not changed.
- A user modifies the spec of the CR, the issue is resolved and the operator removes the condition with a status update.

If there is some subset of the message you are concerned about, you can add additional update logic based on the message for specific fields. For example, if you were concerned about the description field you had mentioned in your original email, you could create a structure that can be converted to and from a string, the comparison could then be done on the struct. This struct would basically look like:

```golang
type ConditionMessage error {
     description string
     requestID string
     time string // or metav1.time or something else
     details string
}

func (cm ConditionMessage) String() {
    // logic
}

func fromString(s string) ConditionMessage {
    // logic
}

func isDiff(a, b ConditionMessage) bool {
  // logic that determins if a condition message has changed enough to warrant an update.
}
```

We can now look at the questions you proposed in your last message:
So, if I update my ressource with random value, then the Reconcile is called directly, but the behaviour I expect (ie the exponential retry) is only triggered because my reconcile loop is ending with a return containing an error value, right ?
 
You are correct that returning an error within the reconcile function requeues the object, but in this case the object will still be requeued if you return no error because of the update to the CR's status. Updates to the CR status will  requeue the CR unless you configured your operator to only process Spec changes via the GenerationChangedPredicate.

If I check if the Condition is already set, how do I do the difference between both cases :
- My CRD is is error state, I previously set a Condition. I do not want another Reconcile until users apply the CRD with new values> I want to skip the reconcile.
- My CRD is in error state, I previously set a Condition but my users applied a change is the CRD …I want to run the reconcile !

IMO - whenever your operator gets an object it should evaluate the desired state specified in the CR against the actual state of the cluster. That means you should ALWAYS process the event, but prevent subsequent requeues if no work was done, which aligns with the suggested solutions shared above.

I hope this helps.

Best,

Alex



--
Alexander Greene
He - Him - His
Senior Software Developer
IRC: agreene

Morgan Leroi

unread,
Aug 20, 2022, 5:06:35 AM8/20/22
to Alex Greene, David Lanouette, Bryce Palmer, Operator Framework
Hello Alex

What a response ! I did not expect such help from all of you ! That’s awesome.

I carefully read all answers and my understanding of the behaviour is much clearer now ... and I’ve plenty of options to test. 

Again, thanks a lot.

I will reply in few hours / day just to specify the solution I selected and why. It could help someone else in the future !

Cheers

Morgan

Austin Macdonald

unread,
Aug 20, 2022, 11:19:16 AM8/20/22
to Morgan Leroi, Alex Greene, David Lanouette, Bryce Palmer, Operator Framework
Are there any docs changes you can suggest that would have helped you? We love PRs ;)

Camila Macedo

unread,
Aug 23, 2022, 6:58:37 AM8/23/22
to Austin Macdonald, Morgan Leroi, Alex Greene, David Lanouette, Bryce Palmer, Operator Framework
Hi Morgan:

Just to supplement: 

As described above the best recommendation is to work with status conditions. Currently, it is described in the doc "common suggestions", see: https://sdk.operatorframework.io/docs/best-practices/common-recommendation/ Also, you can find further information in: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

Therefore, note that in the next SDK release 1.23 we will have a new plugin called Deploy Image (which has been done as part of the Google Summer Code Program 2022[0]).  You can check its demo within the Kubebuilder: https://www.youtube.com/watch?v=UwPuRjjnMjY 

With this new optional plugin you are able to scaffold API/Controllers to deploy and manage an Operand (image) on the cluster following the guidelines and best practices. It abstracts the complexities of achieving this goal while allows you to customize the generated code. (More info[1]), 

You can check it out with SDK build from master branch:
$ operator-sdk init
$ operator-sdk create api --group example.com --version v1alpha1 --kind Memcached \
--image=memcached:1.6.15-alpine \
--image-container-command="memcached,-m=64,modern,-v" \ --image-container-port="11211" \
--run-as-user="1001" \
--plugins="deploy-image/v1-alpha"
On top of that, we have a WIP PR[2] to update the SDK Golang tutorial with this plugin and you will see that it scaffolds the status using the status conditions. So, you can check it out as an example to know how to achieve this goal. 

Before this one is finished and gets merged you can also check the example over how to use it in the testdata samples on kubebuilder, see:


Cheers, 

CAMILA MACEDO

SR. SOFTWARE ENGINEER 

RED HAT Operator framework

Red Hat UK

She / Her / Hers

IM: cmacedo

I respect your work-life balance. Therefore there is no need to answer this email out of your office hours.





Reply all
Reply to author
Forward
0 new messages