Job API Feature Requests for Flux Framework Operator

v

unread,

Nov 11, 2022, 11:19:45 AM11/11/22

to wg-b...@kubernetes.io

Hi Batch Working Group!

I saw Aldo's talk at Kubecon about Job API features and also recently joined the slack, and wanted to start some discussion about features of interest! I figure we can start here, and then possibly open up issue(s) on GitHub. For some background, I'm working on the Flux Operator and we are trying to deploy a Flux Framework "mini cluster" using an indexed job. Some quick notes about the design:

- A "Mini cluster" CRD originally set up the nodes and launched the job, and now it just sets up the nodes, and starts a RESTful API to submit jobs to (WIP).

- One Mini Cluster runs one Flux Framework instance and is owned by one user.

- Each pod "node" is networked through a hack to populate the /etc/hosts of each pod, and includes other shared configs and assets (through volumes) for the nodes.

- There is one "main" broker that starts the RESTFul API via flux, the others start Flux and need to be discovered by the main broker. If there are N nodes total, they all ways to see N ip addresses populated in /etc/hosts.

- As stated above, we use an indexed job.

- In case it matters, I use minikube to develop (the emojis give me life!!) 😆🎉❤🦄

Some features I think we'd like:

- An ability to define multiple pod templates with a worker / driver pattern within the same Indexed job to allow for a startup sequence. Right now I have one wait.sh script that basically says "If I'm index 0, start the server, otherwise just start Flux and expect to be discovered." I suspect others have scripts like this that are essentially if->else noodles. It's a bit messy and I expect to see issues when we try to scale it.

- This would be hugely supported by some ability to look at states of the different pods in the Job. E.g., I would have the workers boot up first, ensure they are all ready, and then I could start the server that expects them to be there. If a worker fails to start, I can re-create without worry.

- An alternative to the above (if we cannot have a worker / driver pattern) would be to allow creation of some N pods first, and then an incremental addition of pods (with an ability to check when the first set are ready).

- And finally, more plug and play ability to have a networked set of pods in the Mini Cluster. I've looked at other operators (e.g., the mpi-operator) and they all have "tricks" to do this. It would be super nice to just have an option that will automatically allow the pods to see one another (a basic ping should work).

Those are the main features that would be really nice to have! I have a few more ideas but I want to keep this email shorter. Thanks y'all for the great discussion, and for working on these APIs! I'm new to developing operators and for Kubernetes and I'm loving it.

Best,

Vanessa

Aldo Culquicondor

unread,

Nov 11, 2022, 4:36:29 PM11/11/22

to v, Tim Hockin, Sergey Kanzhelev, wg-b...@kubernetes.io

Thank you for the list, Vanessa.

I guess the general question is around the startup sequence of pods. This topic has been discussed multiple times. Two threads that come to mind:

- https://github.com/kubernetes/enhancements/pull/2869

- https://github.com/kubernetes/kubernetes/issues/106802

I think the general stance is the containers should be resilient to start in a different order. Still this is something that we can discuss in a WG meeting, but I would like to have @Tim Hockin and @Sergey Kanzhelev present :)

> And finally, more plug and play ability to have a networked set of pods in the Mini Cluster.

This is already supported. You can just add a headless service to the set of Pods, assuming that each has a different hostname. The v2 mpi-operator does this https://github.com/kubeflow/mpi-operator/tree/master/v2.

And you can always add a headless Service to an Indexed Job (briefly mentioned here). Each pod has a hostname that is derived from its index. Not sure if this was your question. You still need to check from the pods if the other pods are reachable.

Aldo

--
To unsubscribe from this group and stop receiving emails from it, send an email to wg-batch+u...@kubernetes.io.

Kevin Hannon

unread,

Nov 15, 2022, 9:10:59 AM11/15/22

to Aldo Culquicondor, Sergey Kanzhelev, Tim Hockin, v, wg-b...@kubernetes.io

Hello,

I was reading some of the docs in the announcement for sig-nodes sidecar working group. I saw a similar idea in their reading docs about adding dag like syntax for multiple containers. I’m afraid I will miss their initial meeting. Not sure what’s in scope for that at this point though.

Tim Hockin

unread,

Nov 15, 2022, 10:58:40 AM11/15/22

to Kevin Hannon, Aldo Culquicondor, Sergey Kanzhelev, v, wg-b...@kubernetes.io

I am happy that discuss intra-pod sequencing - I mostly don't buy a t, but I am very open to arguments why I am wrong. And now is about as good a time as we can get.

Aldo Culquicondor

unread,

Nov 15, 2022, 11:19:07 AM11/15/22

to Tim Hockin, Kevin Hannon, Sergey Kanzhelev, v, wg-b...@kubernetes.io

Intra-pod sequencing is a related discussion, but this thread is actually about inter-pod sequencing within a Job.

Aldo

Aldo Culquicondor

unread,

Nov 15, 2022, 12:57:13 PM11/15/22

to Tim Hockin, Kevin Hannon, Sergey Kanzhelev, v, wg-b...@kubernetes.io

Only one of them, as a related topic.

Aldo

On Tue, Nov 15, 2022 at 12:35 PM Tim Hockin <tho...@google.com> wrote:

The two links above are all about intra-pod sequencing ("keystone")?

Tim Hockin

unread,

Nov 15, 2022, 1:36:09 PM11/15/22

to Aldo Culquicondor, Kevin Hannon, Sergey Kanzhelev, v, wg-b...@kubernetes.io

The two links above are all about intra-pod sequencing ("keystone")?

On Tue, Nov 15, 2022 at 8:19 AM Aldo Culquicondor <aco...@google.com> wrote:
>

v

unread,

Nov 15, 2022, 8:17:19 PM11/15/22

to Aldo Culquicondor, Tim Hockin, Sergey Kanzhelev, wg-b...@kubernetes.io

Hi Aldo and batch!

Thank you again for the links! I've taken a closer look and I have some comments for discussion.

Inter-Job Pods Networking

> And finally, more plug and play ability to have a networked set of pods in the Mini Cluster.

You noted that the v2 mpi-operator has a headless service for a set of pods, but this isn't exactly what we want. We want the pods to be networked together and all see one another, not just to be able to interact with a headless service. If you look into the mpi-operator v2 code, they actually do the same strategy that we took - there is a discovery shell script written as a config map to the pods that updates from a known listing of pods and I assume that the hostfileName is populated (and in our case we populate /etc/hosts, same strategy). But (different topic!) with respect to a headless service, we actually do that too! Since the Flux Mini Cluster needs to have jobs submit to it, I created a RESTFul API with Flux Python bindings and create a service for the index-0 pod. This is a less important topic because I have something working, but in figuring out how to expose to the user (on the local machine) I was only able to get it working with port-forward:

https://flux-framework.org/flux-operator/development/developer-guide.html#port-forward

I was hoping for something more hardened like having it done automatically with the service and some ingress. I didn't hit the nail on the head with that one (and likely will come back to it, although the port forward is fine for now!) Anyway that wasn't my main question to the list - let's get back to that!

Pods with Hostnames via DNS?

For these docs https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode

I did see this (and was very happy to discover it) but in practice I couldn't get it to work. Specifically this bit:

> When you use an Indexed Job in combination with a Service, Pods within the Job can use the deterministic hostnames to address each other via DNS.

Is there a more concrete or dummy example I could look at somewhere? To give an explicit example, I'd want to be able to have a set of N=6 pods in an indexed job to be able to ping one another, no matter what pod I'm on. Here is an example of the /etc/hosts that we populate to achieve that:

# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
172.17.0.4 flux-sample-0.flux-sample.flux-operator.svc.cluster.local flux-sample-0
172.17.0.4 flux-sample-0-flux-sample.flux-operator.svc.cluster.local flux-sample-0
172.17.0.3 flux-sample-1-flux-sample.flux-operator.svc.cluster.local flux-sample-1
172.17.0.6 flux-sample-2-flux-sample.flux-operator.svc.cluster.local flux-sample-2
172.17.0.5 flux-sample-3-flux-sample.flux-operator.svc.cluster.local flux-sample-3

I know it's not best practice to manually tweak this file, but I tried different combinations of DNSPolicy, HostNetwork, Hostname and Subdomain, and SetHostnameasFQDN (fully qualified domain name) and (out of the box) the nodes couldn't see one another (without the hack above). I think what (would be nice, is possible?) would be able to just ask for the pods to be networked (and it doesn't matter how that is done really) so you could essentially do:

        Spec: batchv1.JobSpec{
            ...
            Template: corev1.PodTemplateSpec{...},

                Spec: corev1.PodSpec{
                    // This does not exist I made it up :)
                    InterPodNetwork: true,

                }            },
        },
    }

And then I'd apply the job, and *boum* I could shell into a node and interact with the others - no extra work or thinking needed!

# Shell into index-0

$ kubectl exec --stdin --tty -n flux-operator flux-sample-0-p8tkr

# And boum! Networking!

$ ping flux-sample-1
PING flux-sample-1-flux-sample.flux-operator.svc.cluster.local (172.17.0.3) 56(84) bytes of data.
64 bytes from flux-sample-1-flux-sample.flux-operator.svc.cluster.local (172.17.0.3): icmp_seq=1 ttl=64 time=0.175 ms

...

And importantly, this would need to scale (so my doopy update of /etc/hosts likely will not). So that's a high level "ideal" of what I think we want - right now we have to do what the mpi operator does and dynamically get the ip addresses for the pods and then have some kind of shell script to write them where they are needed. Maybe there is a simpler, more user-friendly way? Some kind of magic that can happen on the backend to help this little set of pods? Or at least a good example out in the wild of accomplishing the inter-pod networking without a custom hack?

Thank you!

Best,

Vanessa

Aldo Culquicondor

unread,

Nov 16, 2022, 8:50:53 AM11/16/22

to v, Tim Hockin, Sergey Kanzhelev, wg-b...@kubernetes.io

Inter-Job Pods Networking

> And finally, more plug and play ability to have a networked set of pods in the Mini Cluster.

You noted that the v2 mpi-operator has a headless service for a set of pods, but this isn't exactly what we want. We want the pods to be networked together and all see one another, not just to be able to interact with a headless service. If you look into the mpi-operator v2 code, they actually do the same strategy that we took - there is a discovery shell script written as a config map to the pods that updates from a known listing of pods and I assume that the hostfileName is populated (and in our case we populate /etc/hosts, same strategy). But (different topic!) with respect to a headless service, we actually do that too! Since the Flux Mini Cluster needs to have jobs submit to it, I created a RESTFul API with Flux Python bindings and create a service for the index-0 pod. This is a less important topic because I have something working, but in figuring out how to expose to the user (on the local machine) I was only able to get it working with port-forward:

I wrote significant parts of the v2 mpi-operator :)

The discovery script is a particular requirement for elastic horovod that allows it to know when workers are added or removed at runtime. This script is not necessary when running traditional MPI Jobs where the number of workers is known at startup. Still, we have to tell MPI what the hostnames are, but that is done without listing.

In both cases (static and elastic) the launcher and workers communicate via hostname, with the form: jobname-worker-0.service-name.namespace.svc (I think in some setups it also works with just jobname-worker-0.service-name). The mapping to IPs happens through the cluster's DNS, configured by the Service.

Unfortunately, we didn't add a sample to the documentation of how to use an Indexed Job in combination with a headless Service, but it's roughly what we did in the v2 mpi-operator:

Set a headless Service that matches the pods of the Job. In the Job's pod template, set the subdomain to match the Service's name. And that's it: a hostname would be something like: indexedjobname-0.servicename.namespace.svc

Maybe something is misconfigured in your kubelet or DNS setup? The kubelet is supposed to add search options to each pod's /etc/resolv.conf: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#namespaces-of-services. Also worth looking at the DNS policy https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy

If you manage to get it working and have some time, we would welcome your contribution to github.com/kubernetes/website to add a tutorial :D

I hope this helps

Abdullah Gharaibeh

unread,

Nov 16, 2022, 11:03:14 AM11/16/22

to Aldo Culquicondor, v, Tim Hockin, Sergey Kanzhelev, wg-b...@kubernetes.io

We have an example in the KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2214-indexed-job#story-2; As Aldo mentioned, it would be great if you can update the Job docs to include the example.

--

v

unread,

Nov 16, 2022, 2:42:32 PM11/16/22

to Tim Hockin, Aldo Culquicondor, Sergey Kanzhelev, wg-b...@kubernetes.io

Working on feedback from Aldo, but wanted to give a quick response!

I don't understand some of what you are saying here - it seems to me

like some of it is supposed to work already, and if it is not working
it indicates some other problem I don't know. More below.

This was my intuition as well. When I tried creating a service with ingress, it appeared to work but the exposed page was blank. The port-forward command worked.

On Tue, Nov 15, 2022 at 5:17 PM v <vso...@gmail.com> wrote:

What does "be networked together" mean?

I can ping a pod from another pod (and get a successful response)

We have API affordances to make this happen already. What you are
describing SHOULD happen as a result of using those (well, it would be
DNS not /etc/hosts, but same idea).

Something Aldo said made me step back:

The mapping to IPs happens through the cluster's DNS, configured by the Service.

Maybe the problem was using minikube without enabling dns? I apologize for my ignorance on all fronts - I'm totally new to this development space! Do others use Minikube (or kind) or have access to actual clusters?

I can't quite understand what "couldn't see one another" means. Does
that mean you can't resolve meaningful names or that traffic doesn't
actually flow?

Just the pings mostly :)

Okay back to trying random things - hopefully if I don't give up I'll stumble on this working!

- V

Tim

v

unread,

Nov 16, 2022, 2:52:55 PM11/16/22

to Tim Hockin, Aldo Culquicondor, Sergey Kanzhelev, wg-b...@kubernetes.io

Okay, time to 😆😭 - I enabled the plugins for MiniKube, removed the hackery of /etc/hosts and it just worked! Below, pinging host-0 (where I'm sitting) and host-1 (a different pod, not in /etc/hosts)

Oh man this is exciting!!! I just need to update the broker config for Flux to include that full hostname. And I think this means I can gut out a lot of code and have a really nice thing working! And I do think it was MiniKube not having some functionality for DNS, because this didn't work before.

Thank you!!! ❤🦄😎

I'll report back if I run into any other issues. And Aldo, if you wrote the mpi-operator, the code is really beautiful.

- V

v

unread,

Nov 16, 2022, 2:55:41 PM11/16/22

to Aldo Culquicondor, Tim Hockin, Kevin Hannon, Sergey Kanzhelev, wg-b...@kubernetes.io

Thanks for the speedy feedback, everyone! I'm hardening up our documentation today (there is a lot of it and I want it organized and nicely rendered) and I should be able to test out some of this advice as soon as I finish that. I will follow up with anything that I learn. Until then... the Flux Operator gopher! 😎

gopher for your pleasure!

Tim Hockin

unread,

Nov 16, 2022, 2:55:41 PM11/16/22

to v, Aldo Culquicondor, Sergey Kanzhelev, wg-b...@kubernetes.io

I don't understand some of what you are saying here - it seems to me
like some of it is supposed to work already, and if it is not working
it indicates some other problem I don't know. More below.

On Tue, Nov 15, 2022 at 5:17 PM v <vso...@gmail.com> wrote:

> > And finally, more plug and play ability to have a networked set of pods in the Mini Cluster.
>
> You noted that the v2 mpi-operator has a headless service for a set of pods, but this isn't exactly what we want. We want the pods to be networked together and all see one another, not just to be able to interact with a headless service.

What does "be networked together" mean?

> > When you use an Indexed Job in combination with a Service, Pods within the Job can use the deterministic hostnames to address each other via DNS.
>
> Is there a more concrete or dummy example I could look at somewhere? To give an explicit example, I'd want to be able to have a set of N=6 pods in an indexed job to be able to ping one another, no matter what pod I'm on. Here is an example of the /etc/hosts that we populate to achieve that:
>
> # Kubernetes-managed hosts file.
> 127.0.0.1 localhost
> ::1 localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> fe00::0 ip6-mcastprefix
> fe00::1 ip6-allnodes
> fe00::2 ip6-allrouters
> 172.17.0.4 flux-sample-0.flux-sample.flux-operator.svc.cluster.local flux-sample-0
> 172.17.0.4 flux-sample-0-flux-sample.flux-operator.svc.cluster.local flux-sample-0
> 172.17.0.3 flux-sample-1-flux-sample.flux-operator.svc.cluster.local flux-sample-1
> 172.17.0.6 flux-sample-2-flux-sample.flux-operator.svc.cluster.local flux-sample-2
> 172.17.0.5 flux-sample-3-flux-sample.flux-operator.svc.cluster.local flux-sample-3

We have API affordances to make this happen already. What you are
describing SHOULD happen as a result of using those (well, it would be
DNS not /etc/hosts, but same idea).

> I know it's not best practice to manually tweak this file, but I tried different combinations of DNSPolicy, HostNetwork, Hostname and Subdomain, and SetHostnameasFQDN (fully qualified domain name) and (out of the box) the nodes couldn't see one another (without the hack above). I think what (would be nice, is possible?) would be able to just ask for the pods to be networked (and it doesn't matter how that is done really) so you could essentially do:

I can't quite understand what "couldn't see one another" means. Does
that mean you can't resolve meaningful names or that traffic doesn't
actually flow?

Tim

Tim Hockin

unread,

Nov 16, 2022, 2:55:41 PM11/16/22

to v, Aldo Culquicondor, Sergey Kanzhelev, wg-b...@kubernetes.io

On Wed, Nov 16, 2022 at 11:42 AM v <vso...@gmail.com> wrote:
>
> Working on feedback from Aldo, but wanted to give a quick response!
>
> I don't understand some of what you are saying here - it seems to me
>>
>> like some of it is supposed to work already, and if it is not working
>> it indicates some other problem I don't know. More below.
>>
> This was my intuition as well. When I tried creating a service with ingress, it appeared to work but the exposed page was blank. The port-forward command worked.
>
>>
>> On Tue, Nov 15, 2022 at 5:17 PM v <vso...@gmail.com> wrote:
>>
>> What does "be networked together" mean?
>>
> I can ping a pod from another pod (and get a successful response)

The kubernetes network model FUNDAMENTALLY assumes that pods can
communicate with other pods, across nodes. If that isn't the case,
all bets are off.

https://kubernetes.io/docs/concepts/cluster-administration/networking/

DNS has a role in this, but it's a discovery mechanism, not a
correctness mechanism. Assuming there's not some other network policy
_blocking_ traffic, any Pod should be able to ping (or connect by
TCP/UDP) any other Pod.

v

unread,

Nov 16, 2022, 5:54:31 PM11/16/22

to Tim Hockin, Aldo Culquicondor, Sergey Kanzhelev, wg-b...@kubernetes.io

Yeah! I think this was me not knowing I needed those MiniKube plugins, we can reduce that to a #vanessaproblem. Installing those plugins made the pod networking work (as expected)!

This is super cool - with the refactor the startup time for our Mini Cluster is now under 20 seconds (down from ~80) and we are in business! This is a small lammps job (before run with MPI, here with Flux Framework):

https://www.youtube.com/watch?v=cUYSGojUuAU 🎉

I'm going to do some tweaks to our web interface and client, but after that I should be able to get into containerizing a bunch of HPC workloads and better testing how this setup works (note this is Indexed Jobs with a service).

On a high level, I want us folks over in "HPC land" to better collaborate with cloud native technologies (my background is containers so I love that space), and I think this kind of extension (HPC oriented jobs mapped into Kubernetes) would be so great!

I'll ping y'all for more discussion after a bit more learning and development.

Aldo Culquicondor

unread,

Nov 17, 2022, 8:24:28 AM11/17/22

to v, Tim Hockin, Sergey Kanzhelev, wg-b...@kubernetes.io

> with the refactor the startup time for our Mini Cluster is now under 20 seconds (down from ~80) and we are in business!

That's consistent with the difference between mpi-operator v1 compared to v2 :D

> On a high level, I want us folks over in "HPC land" to better collaborate with cloud native technologies (my background is containers so I love that space), and I think this kind of extension (HPC oriented jobs mapped into Kubernetes) would be so great!

> I'll ping y'all for more discussion after a bit more learning and development.