Changing runtime configuration of ovnkube daemons using per-node configmap??

Girish Moodalbail

unread,

Mar 19, 2020, 3:05:58 AM3/19/20

to Dan Williams, Dan Winship, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

Hello all,

I am exploring to add support for changing runtime configuration for ovnkube-master and ovnkube-node daemons without having to restart these daemons.

The first thing I need to do for this is instead of deriving all the ovnkube's CLI options from the environment variables in ovnkube.sh, capture all the options/parameters in ovn-k8s.conf file and use that file. The file itself will come from deployer/operator as a ConfigMap entry 'ovn-k8s.conf = <fileContents>' and this entry will be mounted as a volume inside the ovnkube-{master|node} containers at a well-known path.

The ovn-k8s codebase already has a watcher that watches for changes to this file and currently kills the daemon on changes to the file. Instead of blindly exiting, the plan was to check what changed and see if those changed parameters can be applied at runtime without killing the daemon.

We have two daemons - ovnkube-master and ovnkube-node. The current ovnkube CLI options applies either for master or for node or for both. To keep things simple, we will capture all the options in the file and rely on the master and node daemon doing the right thing. This file will be common for all the masters and nodes in the K8s cluster.

Now, if the deployer wants to change the config of an individual node, say set debugging level to 5, then they will have to create a per-node ConfigMap (namespace=ovn-kubernetes and name=node_hostname).

This is where I got stuck.

1. The ovnkube-node is a daemonset. So, how do I specify a unique ConfigMap for each of the node in
the cluster? I cannot specify a Volume with a variable name like this below, right?

Volume:
- name: node-config
configMap:
name: K8S_NODE_NAME
optional: true

2. Say, that I am able to do that. The pod will start if the ConfigMap is missing at the boot time since it is optional. So, the VolumeMount is completely skipped. Now, if I have to change something for that node and I create a ConfigMap, then I don't think the data will get mounted as a file since the pod is already running and has completely skipped the volumemount.

3. If I have 600 node cluster, then I cannot create 600 ConfigMaps in anticipation that I will need it in future to update some runtime config.

4. The other option is to create `one` ConfigMap and have several keys where each `key` is a hostname and mount this configmap in both ovnkube-node and ovnkube-master. The issue here is that 'all' the 600 ovnkube-node containers will have other node's configuration file.

apiVersion: v1
kind: ConfigMap
metadata:
name: ovnkube-config
namespace: ovn-kubernetes
data:
global: |
[logging]
loglevel=4
<snipped>

node1: |
[logging]
loglevel=4
<snipped>

node2: |
[logging]
loglevel=7
<snipped>

node3: |
[logging]
loglevel=1

In node1 at mountpath, say /var/lib/ovn, you will see /var/lib/ovn/node1, /var/lib/ovn/node2, and so on. I am not sure if it is desirable to have that?

Any thoughts or comments? Is ConfigMap the right approach if we need per-node changes?

Thanks,
~Girish

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Girish Moodalbail

unread,

Mar 23, 2020, 5:06:19 PM3/23/20

to Dan Williams, Dan Winship, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

Hello folks,

Any thoughts on this?

Changing things at runtime will become very crucial at scale. One of the things we are noticing is that the default
election_timer of 1s for the RAFT is too low at scale. There are frequent elections since the SB DB server gets too busy and doesn't send out the heartbeat and others assume the leader is gone and the election starts. So, based on the scale we need to adjust this timer. Start at 1s and increase it as we scale. There are lot of other use cases that requires one to change other settings at runtime.

Regards,
~Girish

Dan Winship

unread,

Mar 23, 2020, 8:22:51 PM3/23/20

to Girish Moodalbail, Dan Williams, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

On 3/23/20 5:06 PM, Girish Moodalbail wrote:
> Hello folks,
>
> Any thoughts on this?
>
> Changing things at runtime will become very crucial at scale. One of the things we are noticing is that the default
> election_timer of 1s for the RAFT is too low at scale. There are frequent elections since the SB DB server gets too busy and doesn't send out the heartbeat and others assume the leader is gone and the election starts. So, based on the scale we need to adjust this timer. Start at 1s and increase it as we scale. There are lot of other use cases that requires one to change other settings at runtime.

But that doesn't require a *per-node* change. You'd be changing the
configuration for all the masters.

> The ovn-k8s codebase already has a watcher that watches for changes to this file and currently kills the daemon on changes to the file. Instead of blindly exiting, the plan was to check what changed and see if those changed parameters can be applied at runtime without killing the daemon.

It seems to me that if you have that then you don't really need to worry
about having per-node configmaps.

> 4. The other option is to create `one` ConfigMap and have several keys where each `key` is a hostname and mount this configmap in both ovnkube-node and ovnkube-master. The issue here is that 'all' the 600 ovnkube-node containers will have other node's configuration file.

What do you need to configure per-node exactly? Your example just shows
loglevel, but you're not really going to need to configure specific log
levels for every node.

-- Dan

Aniket Bhat

unread,

Mar 23, 2020, 10:31:55 PM3/23/20

to Girish Moodalbail, Dan Williams, Dan Winship, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

@Girish Moodalbail

looking at how the kubelet reconfiguration works: https://kubernetes.io/docs/tasks/administer-cluster/reconfigure-kubelet/#reconfiguring-the-kubelet-on-a-running-node-in-your-cluster.

We can potentially follow a similar pattern. i.e. in the node object, create a unique config map reference and use that object to reconfigure the ovnkube-* pods.

Another option is to write the name and namespace of the config map as an annotation and use that. We don't need to necessarily volume mount it to consume right?

Once we see the annotation on the node or the node.Spec.ConfigSource change, can we not watch for the config map for it's specific contents? Yeah, it's an extra GET, but hopefully the config doesn't change as frequently for the extra watch to be a bottleneck.

Thanks,

Aniket.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/558EDC39-0C26-4DB5-A81F-A0EFA70C6DE9%40nvidia.com.

Aniket Bhat

unread,

Mar 24, 2020, 7:43:28 AM3/24/20

to Girish Moodalbail, Dan Williams, Dan Winship, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

I spoke to Clayton Coleman about alternatives and he suggested creating a OVNNodeConfig CRD that the ovnkube-* processes reconcile. I think that is the most idiomatic way to go about it.

Girish Moodalbail

unread,

Mar 24, 2020, 2:47:25 PM3/24/20

to Dan Williams, Dan Winship, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

Hello Dan/Aniket,

Thank you for chiming in. Will reply to your emails separately. Hopefully, the email below should answer some of the questions you have asked.

The following table captures all the configuration parameters that we have in ovn-k8s. Based on the subsystem they target, we have grouped them under several buckets -- Kubernetes, CNI, Logging, Gateway, and so on. (note: I have skipped some config parameters that are not ovn-k8s configuration file-based)

These are my definitions of `scope`:
OVN Network -> applies to the OVN logical topology that represents the primary interface to the pod.
(When we have support for multiple OVN interfaces to a pod, then these parameters will be different and
Will be captured in the respective network-attachment-definition CRD for that respective OVN interfaces)
GlobalMaster -> these parameters are same for all the ovnkube-master pods in the cluster
GlobalNode -> these parameters are same for all the ovnkube-node pods in the cluster
Per-Node -> these parameters "can" be different for every node in the cluster (see examples after the table).
(for simple deployments they can be same for all nodes, but for complex deployments they will be different)
Global -> these parameters are same for all ovnkube-master and ovnkube-node pods in the cluster

These are my definitions of `modifiability`:
Static -> Deployment time setting and the value will never change. If you have to change the value, then you will need to
delete all the ovn-k8s yamls, change the values, and re-deploy
Dynamic -> the setting can be changed at runtime. Your cluster scales over time and you see issues with scale and you want
to change the values at `runtime` without `daemon` restarts, if possible.

|------------+------------------------+--------------+---------------+-----------------------|
| section | config | scope (w.r.t | modifiability | comments |
| | | cluster) | | |
|------------+------------------------+--------------+---------------+-----------------------|
| Default | MTU | OVN Network | Static | |
| | Conntrack Zone | OVN Network | Static | |
| | EncapType | GlobalMaster | Static | |
| | EncapIP | Per-Node | Static | |
| | EncapPort | GlobalMaster | Static | |
| | OVNInactivityProbe | GlobalNode | Dynamic | ovn-controller's |
| | OpenFlowProbe | GlobalNode | Dynamic | ovn-controller's |
| | ClusterSubnets | OVN Network | Static | |
| | MetricsBindAddress | GlobalMaster | Static | for master |
| | MetricsBindAddress | GlobalNode | Static | for node |
| | MetricsEnablePprof | Global | Static | |
| | | | | |
| | | | | |
| Kubernetes | Kubecofnig | Global | Static | We don't need any |
| | APIServer | Global | Static | of these in ovn-k8s |
| | CACert | Global | Static | daemonset deployments |
| | Token | Global | Static | if we use |
| | | | | InClusterConfig() |
| | | | | |
| | ServiceCIDR | Global | Static | |
| | OVNConfigNamespace | Global | Static | |
| | OVNEmptyLbEvents | GlobalMaster | Dynamic? | |
| | RawNoHostSubnetNodes | GlobalMaster | Static | |
| | | | | |
| | | | | |
| Logging | LogFile | Global | Static | |
| | Level | Per-Node | Dynamic | |
| | | | | |
| CNI | ConfDir | GlobalNode | Static | |
| | Plugin | GlobalNode | Static | |
| | | | | |
| Gateway | Mode | Per-node | Dynamic | |
| | Interface | Per-node | Dynamic | |
| | NextHop | Per-node | Dynamic | |
| | VLANID | Per-node | Dynamic | |
| | NodeportEnable | Per-node | Dynamic | |
| | | | | |
| MasterHA | ElectionLeaseDuration | GlobalMaster | Dynamic | |
| | ElectionRenewDeadline | GlobalMaster | Dynamic | |
| | ElectionRetryPeriod | GlobalMaster | Dynamic | |
| | | | | |
| OVNDB | NBPort | Global | Static | |
| | SBPort | Global | Static | |
| | NBRaftPort | Global | Static | |
| | SBRaftPort | Global | Static | |
| | NBInactivityProbe | Global | Dynamic | |
| | SBInactivityProbe | Global | Dynamic | |
| | NorthDdInactivityProbe | Global | Dynamic | |
| | ClusterElectionTimer | Global | Dynamic | |
| | | | | |
|------------+------------------------+--------------+---------------+-----------------------|

Examples:

1. Per-node loglevel:
In a 600-node cluster, we need to debug a single misbehaving ovnkube-node daemon. To debug the issue, we need a way to change the klog level to 5 on that node's daemon alone and not on all of the nodes in the cluster.

2. Per-node gateway settings:
Imagine a DC with routed IP networks. The L2 networks do not span the physical network core. The L2 terminates at the Top-of-the-rack switch and everything else is L3 in the core. If a rack consists of 10 physical nodes and our cluster size is 100-nodes, then we will have 10 racks.
Each rack will have its own L2 underlay. So, the gateway nexthop and vlan-id will be different for every rack.

3. NodeportEnable settings:
In a 100-node cluster, we don't want to enable NodePorts on 80 of the nodes. We want LB/Ingress to go through only the 20 of the remaining nodes and nothing else

4. per-node encapIP settings
This is similar to point 2 above. Imagine having a single L2 across all 640 nodes in 64 racks in a DC and the amount of L2 packets on that underlay generated by the 640 nodes. So, L2/L3 boundary will be present in large datacenters.

Regards,
~Girish

On 3/23/20, 2:06 PM, "ovn-kub...@googlegroups.com on behalf of Girish Moodalbail" <ovn-kub...@googlegroups.com on behalf of gmood...@nvidia.com> wrote:

External email: Use caution opening links or attachments

Girish Moodalbail

unread,

Mar 24, 2020, 2:59:56 PM3/24/20

to Dan Winship, Dan Williams, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

On 3/23/20, 5:22 PM, "Dan Winship" <da...@redhat.com> wrote:

External email: Use caution opening links or attachments

On 3/23/20 5:06 PM, Girish Moodalbail wrote:
> Hello folks,
>
> Any thoughts on this?
>
> Changing things at runtime will become very crucial at scale. One of the things we are noticing is that the default
> election_timer of 1s for the RAFT is too low at scale. There are frequent elections since the SB DB server gets too busy and doesn't send out the heartbeat and others assume the leader is gone and the election starts. So, based on the scale we need to adjust this timer. Start at 1s and increase it as we scale. There are lot of other use cases that requires one to change other settings at runtime.

But that doesn't require a *per-node* change. You'd be changing the
configuration for all the masters.

Right. That was not a correct example. See my other thread on various configuration parameters that can be changed on a per-node bassis.

> The ovn-k8s codebase already has a watcher that watches for changes to this file and currently kills the daemon on changes to the file. Instead of blindly exiting, the plan was to check what changed and see if those changed parameters can be applied at runtime without killing the daemon.

It seems to me that if you have that then you don't really need to worry
about having per-node configmaps.

But the changes to the ConfigMap will affect all the nodes in the cluster.

> 4. The other option is to create `one` ConfigMap and have several keys where each `key` is a hostname and mount this configmap in both ovnkube-node and ovnkube-master. The issue here is that 'all' the 600 ovnkube-node containers will have other node's configuration file.

What do you need to configure per-node exactly? Your example just shows
loglevel, but you're not really going to need to configure specific log
levels for every node.

Please see my other email.

Phil Cameron

unread,

Mar 24, 2020, 4:13:13 PM3/24/20

to ovn-kub...@googlegroups.com

On Openshift, the cluster-network-operator makes sure configuration, including the configmaps, doesn't change. When it finds a change it rolls it back. Clearly this could be changed. However, ...

The cluster should be able to internally figure out the scale and make adjustments to keep running properly. Why should admins get involved? Do we expect admins that do get involved and follow the docs to get it right? How many support calls will we get on this?

Just a thought.

To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CA%2BcBUdiHhCRsjLNsBb3pjgz%2BGRKco%2B_LNQRA2rsdBQ%3DpRM-XPw%40mail.gmail.com.

Girish Moodalbail

unread,

Mar 24, 2020, 4:24:31 PM3/24/20

to Aniket Bhat, Dan Williams, Dan Winship, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

Hello Aniket,

Thanks for the pointer to the dynamic kubelet configuration. I went through that link and IMHO that pattern applies very well to ovn-k8s needs.

We could very well take the ovn-k8s.conf and wrap it in a new OVNNodeConfig CRD and make every ovnkube-node and ovnkube-master reconcile on it. However, I am not sure if we need a CRD mechanism to achieve the simple requirement we have and especially given that we have a precedence in k8s core to achieve the same using ConfigMap?.

I don’t think we can use the node.Spec.ConfigSource field itself as it is set aside for `kubelet`. Instead, we could use a separate node annotation under `k8s.ovn.org` namespace. The good thing about this pattern is that the same ConfigMap could be used across several nodes by referencing the same ConfigMap Name across all of the nodes that requires same configuration (for example: all the nodes in one rack could share the same gateway configuration

The workflow will look something like this:

We consolidate all the ovn-k8s configuration parameters into file
To begin with the ovnkube-masters and ovnkube-nodes share the same global ovn-k8s.conf (wrapped in a ConfigMap).
The ovnkube container’s entrypoint script, ovnkube.sh, reads the ConfigMap contents and writes it to a file in /etc/ovn-kubernetes and starts the daemons with pointer to this file
The ovnkube daemons watches for the node changes and if k8s.ovn.org/configSource is set or modified , then it would act upon it to reflect those changes

(there are more steps missing above – provide users a way to retrieve the current running configuration of the daemon so that they can modify things on top of it, and so on)

What do others think? Should we talk more about this during tomorrow’s community meeting?

Regards,

~Girish

From: Aniket Bhat <anb...@redhat.com>
Date: Tuesday, March 24, 2020 at 4:43 AM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Dan Williams <dc...@redhat.com>, Dan Winship <da...@redhat.com>, "c...@redhat.com" <c...@redhat.com>, Jacob Tanenbaum <jtan...@redhat.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: Changing runtime configuration of ovnkube daemons using per-node configmap??

External email: Use caution opening links or attachments

I spoke to Clayton Coleman about alternatives and he suggested creating a OVNNodeConfig CRD that the ovnkube-* processes reconcile. I think that is the most idiomatic way to go about it.

Dan Williams

unread,

Mar 25, 2020, 5:13:31 PM3/25/20

to Girish Moodalbail, Dan Winship, c...@redhat.com, Jacob Tanenbaum, ovn-kub...@googlegroups.com

On Tue, 2020-03-24 at 18:47 +0000, Girish Moodalbail wrote:
> Hello Dan/Aniket,
>
> Thank you for chiming in. Will reply to your emails separately.
> Hopefully, the email below should answer some of the questions you
> have asked.

So we settled on custom resources. I hacked up a CRD (not validated at
all) that shows what I was thinking of. Attached.

Dan

001-crd.yaml

Reply all

Reply to author

Forward