Ansible Operator Controller Incurring High CPU Penalty

137 views
Skip to first unread message

Matt Peterson

unread,
May 4, 2021, 6:49:27 PM5/4/21
to Operator Framework
Hello, 

I'm new to writing Operators and I apologize if this question was addressed before.  First of all, thanks for all of the hard work on the the Operator SDK.  The framework enabled me to quickly write a robust implementation using the Ansible Operator (1.5.1) to manage one of our products.

I'm using the Ansible community.kubernetes.k8s module to start a headless Service and a StatefulSet to manage the product pods.  While the pods are starting up, I need to watch them to ensure they get configured properly via API calls from the controller.  So, I set up a watch on the pods in watches.yaml with a matchLabels directive to filter it down to just the pods I care about.  This approach works well for getting the pods into the correct state.  During reconciliation, I quickly exit the Ansible role if a pod is already configured to avoid additional processing cost.  However, pod events continue to stream into the Ansible role reconciliation code even when they're simply idling in the cluster.  I'm not sure why there are so many events since nothing is changing the pod state and I've filtered the watch down with the matchLabels.  However, the controller cpu hovers around 900 millicpus per pod that it's watching (linear scaling).

Ideally, I'd like to continue using the headless Service and StatefulSet to avoid the complexities of managing the pod lifecycle events in Ansible.  Is there a way to bring down the cpu in this context?  Thanks for the help.

Matt

Camila Macedo

unread,
May 5, 2021, 7:09:02 AM5/5/21
to Matt Peterson, Operator Framework
Hi Matt, 

When an operator watches many resources, each reconcile can become expensive, and a low value in the reconcile period for example can actually reduce performance. Usually, you will watch the resources that your project own and trigger the reconcile when some event is raised such as the resource is updated, deleted, edited and or created.

So, I'd like to suggest you check out the following documents to understand the watch feature and its options to see how you better can address your needs. 
After that, if you still requiring help with we will need to have a further context about your scenario and requirements such as to know what/why/when you need to reconcile. Maybe raise an issue in the repo and answer fully the template questions can be helpful in this case. 

I hope that it helps you out. 

Cheers, 

CAMILA MACEDO

SR. SOFTWARE ENGINEER 

RED HAT Operator framework

Red Hat UK

She / Her / Hers

IM: cmacedo





CONFIDENTIALITY NOTICE: This email may contain confidential and privileged material for the sole use of the intended recipient(s). Any review, use, distribution or disclosure by others is strictly prohibited.  If you have received this communication in error, please notify the sender immediately by e-mail and delete the message and any file attachments from your computer. Thank you.

--
You received this message because you are subscribed to the Google Groups "Operator Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to operator-framew...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/d4214a97-812e-48b4-8c55-9e1b99542d1en%40googlegroups.com.

Daniel Messer

unread,
May 5, 2021, 9:48:03 AM5/5/21
to Camila Macedo, Matt Peterson, Operator Framework
Maybe I read this incorrectly but it seems what Matt is seeing is that, with a watch on Pods, there are a lot of events generated by Kubernetes about those Pods, even when they are not changing in state ("idling"). So it's not about watching too much but "being called too often", even though nothing changed. Is that what you mean?



--
Daniel Messer

Product Manager Operator Framework & Quay

Red Hat OpenShift

Camila Macedo

unread,
May 5, 2021, 10:56:57 AM5/5/21
to Daniel Messer, Matt Peterson, Operator Framework
Hi Matt, 

Could you please share the watches.yaml file as well for we have a better idea over the configuration made? Also, could you let us know if you are using the option `reconcilePeriod `?

jesus m. rodriguez

unread,
May 5, 2021, 12:21:36 PM5/5/21
to Camila Macedo, Daniel Messer, Matt Peterson, Operator Framework
Matt,

We've seen cases where dependent items cause events to continue coming in. Recently a
PR [1] was merged to help identify situations like yours where we can see what is
causing the changes.

Also, there was a PR [2] that fixed an issue with the predicate to ignore managed
fields.

[1] https://github.com/operator-framework/operator-sdk/pull/4779
[2] https://github.com/operator-framework/operator-lib/pull/59

These might help identify why the Ansible operator is using high cpu.

Sincerely,
jesus

--
jesus m. rodriguez          | jes...@redhat.com
principal software engineer | irc: zeus
red hat operator sdk    | 919.754.4413 (w)
rhce # 805008586930012      | 919.623.0080 (c)
+---------------------------------------------+
|   "Those who cannot remember the past       |
|    are condemned to repeat it."             |
|                        -- George Santayana  |
+---------------------------------------------+



On Wed, 2021-05-05 at 15:56 +0100, Camila Macedo wrote:
> Hi Matt,
>
> Could you please share the watches.yaml file as well for we have a better
> idea over the configuration made? Also, could you let us know if you are
> using the option `reconcilePeriod `?
>
> Cheers,
>
> CAMILA MACEDO
>
> SR. SOFTWARE ENGINEER
>
> RED HAT Operator framework
>
> Red Hat UK <https://www.redhat.com/>
>
> She / Her / Hers
>
> IM: cmacedo
> <https://red.ht/sig>
>
>
>
>
> On Wed, May 5, 2021 at 2:48 PM Daniel Messer <dme...@redhat.com> wrote:
>
> > Maybe I read this incorrectly but it seems what Matt is seeing is that,
> > with a watch on Pods, there are a lot of events generated by Kubernetes
> > about those Pods, even when they are not changing in state ("idling"). So
> > it's not about watching too much but "being called too often", even though
> > nothing changed. Is that what you mean?
> >
> > On Wed, May 5, 2021 at 1:09 PM Camila Macedo <cma...@redhat.com> wrote:
> >
> > > Hi Matt,
> > >
> > > When an operator watches many resources, each reconcile can become
> > > expensive, and a low value in the reconcile period for example can actually
> > > reduce performance. Usually, you will watch the resources that your project
> > > own and trigger the reconcile when some event is raised such as the
> > > resource is updated, deleted, edited and or created.
> > >
> > > So, I'd like to suggest you check out the following documents to
> > > understand the watch feature and its options to see how you better
> > > can address your needs.
> > >
> > >    - *Ansible Operator Watches:*
> > >   
> > > https://sdk.operatorframework.io/docs/building-operators/ansible/reference/watches/
> > >    - *Dependent Watches*:
> > >   
> > > https://sdk.operatorframework.io/docs/building-operators/ansible/reference/dependent-watches/
> > >
> > > After that, if you still requiring help with we will need to have a
> > > further context about your scenario and requirements such as to know
> > > what/why/when you need to reconcile. Maybe raise an issue in the repo and
> > > answer fully the template questions can be helpful in this case.
> > >
> > > I hope that it helps you out.
> > >
> > > Cheers,
> > >
> > > CAMILA MACEDO
> > >
> > > SR. SOFTWARE ENGINEER
> > >
> > > RED HAT Operator framework
> > >
> > > Red Hat UK <https://www.redhat.com/>
> > >
> > > She / Her / Hers
> > >
> > > IM: cmacedo
> > >
> > > <https://red.ht/sig>
> > > > *CONFIDENTIALITY NOTICE: This email may contain confidential and
> > > > privileged material for the sole use of the intended recipient(s). Any
> > > > review, use, distribution or disclosure by others is strictly prohibited.
> > > > If you have received this communication in error, please notify the sender
> > > > immediately by e-mail and delete the message and any file attachments from
> > > > your computer. Thank you.*
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google
> > > > Groups "Operator Framework" group.
> > > > To unsubscribe from this group and stop receiving emails from it, send
> > > > an email to operator-framew...@googlegroups.com.
> > > > To view this discussion on the web visit
> > > > https://groups.google.com/d/msgid/operator-framework/d4214a97-812e-48b4-8c55-9e1b99542d1en%40googlegroups.com
> > > > <https://groups.google.com/d/msgid/operator-framework/d4214a97-812e-48b4-8c55-9e1b99542d1en%40googlegroups.com?utm_medium=email&utm_source=footer
> > > > >
> > > > .
> > > >
> > > --
> > > You received this message because you are subscribed to the Google Groups
> > > "Operator Framework" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an
> > > email to operator-framew...@googlegroups.com.
> > > To view this discussion on the web visit
> > > https://groups.google.com/d/msgid/operator-framework/CACQ0tdCwSVQYYEDEoSqNZ-nGbeouF9EYKFg-3UirQAp1-rQZ6Q%40mail.gmail.com
> > > <https://groups.google.com/d/msgid/operator-framework/CACQ0tdCwSVQYYEDEoSqNZ-nGbeouF9EYKFg-3UirQAp1-rQZ6Q%40mail.gmail.com?utm_medium=email&utm_source=footer
> > > >
> > > .

Matt Peterson

unread,
May 6, 2021, 10:38:20 AM5/6/21
to Operator Framework
Hi Camila,

Thanks for taking a look at my issue.  I've tried dialing back the reconcilePeriod to 2 mins but I still get a steady stream of calls to my ansible code.  I've tried boosting the logging to see what's triggering these calls given that the pods are stable and running (no changes to their lifecycle).  I'll look through the links you provided to see if I missed something.

Thanks,

Message has been deleted

Matt Peterson

unread,
May 6, 2021, 10:50:31 AM5/6/21
to Operator Framework
Hi Daniel,

Yes, that's what I meant.  The pods created by the StatefulSet are up and in a running state but my ansible code gets called continuously even though nothing is changing with them.  This is in contrast to the dependent resources I directly created (StatefulSet, headless service) which seem to correctly only call my ansible code when something changes.

Thanks,

Matt Peterson

unread,
May 6, 2021, 10:59:24 AM5/6/21
to Operator Framework
Hi Jesus,

Thanks for the github links.  Yeah, this could be what I'm seeing: "We've had users reporting infinite reconciliation and this is likely the culprit."  It looks like these fixes went in recently.  Are they targeted for v1.7.2 or v1.8.0? 

jesus m. rodriguez

unread,
May 6, 2021, 12:16:22 PM5/6/21
to Matt Peterson, Operator Framework
Matt,

The PR for logging events is in 1.7.1 already. This can be used to get insight into
what might be causing the reconciles.

The PR against operator-lib has not yet been released and will likely land in 1.8.0.
We might be able to backport it to 1.7.3 (since 1.7.2 is going out today).

Sincerely,
jesus

On Thu, 2021-05-06 at 07:59 -0700, 'Matt Peterson' via Operator Framework wrote:
> Hi Jesus,
>
> Thanks for the github links.  Yeah, this could be what I'm seeing: "We've
> had users reporting infinite reconciliation and this is likely the
> culprit."  It looks like these fixes went in recently.  Are they targeted
> for v1.7.2 or v1.8.0?
>
> On Wednesday, May 5, 2021 at 10:21:36 AM UTC-6 jes...@redhat.com wrote:
>
> > Matt,
> >
> > We've seen cases where dependent items cause events to continue coming in.
> > Recently a
> > PR [1] was merged to help identify situations like yours where we can see
> > what is
> > causing the changes.
> >
> > Also, there was a PR [2] that fixed an issue with the predicate to ignore
> > managed
> > fields.
> >
> > [1] https://github.com/operator-framework/operator-sdk/pull/4779
> > [2] https://github.com/operator-framework/operator-lib/pull/59
> >
> > These might help identify why the Ansible operator is using high cpu.
> >
> > Sincerely,
> > jesus
> >
> > --
> > jesus m. rodriguez          | jes...@redhat.com
> > principal software engineer | irc: zeus
> > red hat operator sdk    | 919.754.4413 <(919)%20754-4413> (w)
> > rhce # 805008586930012      | 919.623.0080 <(919)%20623-0080> (c)
> --
> _CONFIDENTIALITY NOTICE: This email may contain confidential and privileged
> material for the sole use of the intended recipient(s). Any review, use,
> distribution or disclosure by others is strictly prohibited.  If you have
> received this communication in error, please notify the sender immediately
> by e-mail and delete the message and any file attachments from your
> computer. Thank you._
>

Camila Macedo

unread,
May 6, 2021, 2:02:07 PM5/6/21
to Jesus Rodriguez, Matt Peterson, Operator Framework
Hi Matt, 


  • reconcilePeriod (optional): The maximum interval that the operator will wait before beginning another reconcile, even if no watched events are received. When an operator watches many resources, each reconcile can become expensive, and a low value here can actually reduce performance. Typically, this option should only be used in advanced use cases where watchDependentResources is set to False and when is not possible to use the watch feature. E.g To manage external resources that don’t emit Kubernetes events. The format for the duration string is a sequence of decimal numbers, each with an optional fraction and a unit suffix, such as “300ms”, “1.5h” or “2h45m”. Valid time units are “ns”, “us” (or “µs”), “ms”, “s”, “m”, “h”.
So, wdyt about check the links shared and see if you can not use the watches feature to check these pods? Are these pods created by your operator or not? Why do you need to set the reconcilePeriod? 

I hope that helps you. 

Cheers, 

CAMILA MACEDO

SR. SOFTWARE ENGINEER 

RED HAT Operator framework

Red Hat UK

She / Her / Hers

IM: cmacedo





Fabian von Feilitzsch

unread,
May 6, 2021, 2:10:35 PM5/6/21
to Matt Peterson, Operator Framework
On Thu, May 6, 2021 at 10:38 AM 'Matt Peterson' via Operator Framework <operator-...@googlegroups.com> wrote:
Hi Camila,

Thanks for taking a look at my issue.  I've tried dialing back the reconcilePeriod to 2 mins but I still get a steady stream of calls to my ansible code.  I've tried boosting the logging to see what's triggering these calls given that the pods are stable and running (no changes to their lifecycle).  I'll look through the links you provided to see if I missed something.


Unless you need to regularly reconcile off-cluster resources that don't generate kubernetes events, I'd recommend avoiding reconcilePeriod entirely, especially if you'll be reconciling a large number of resources. By nature it will lead to a lot of useless reconciliations and can easily lock up your operator. Watching dependent resources should be adequate for anything you need to handle on-cluster event-generating resources.

[sorry for the double reply matt, forgot to send it to the list]

jesus m. rodriguez

unread,
May 7, 2021, 5:32:16 PM5/7/21
to Matt Peterson, Operator Framework
Matt,

I wanted to let you know that the fix I mentioned below actually went out with v1.7.2
[1] today. It was also backported to v1.6.4 [2]. Let us know if you have any issues
with it.

[1] https://github.com/operator-framework/operator-sdk/releases/tag/v1.7.2
[2] https://github.com/operator-framework/operator-sdk/releases/tag/v1.6.4

Sincerely,
jesus

Matt Peterson

unread,
May 10, 2021, 12:40:18 PM5/10/21
to Operator Framework
Thanks for the heads up.  I'll try out the fix in those version.  Thanks everyone for the discussions and help!

Matt

Reply all
Reply to author
Forward
0 new messages