How can I install only tfjob, mpijob and pytorch operator

418 views
Skip to first unread message

as...@lyft.com

unread,
Nov 18, 2020, 12:06:12 PM11/18/20
to kubeflow-discuss
Hello Experts - I would like to spawn distributed training using the mpijob and tfjob operators. However, I do not need to install entire Kubeflow at the momemt due to other restrictions. All the docs from the individual operator modules lead to the instructions for installing the entire kubeflow.

Is there a way to install only the required tfjob,pytorchjob and mpijob operators?

Thanks
Anindya

Niklas Hansson

unread,
Nov 18, 2020, 5:13:28 PM11/18/20
to as...@lyft.com, kubeflow-discuss
Hi,

You can deploy the individual parts. It is not that well documented I think, struggled to find it as well.  Kubeflow/manifests holds different kustomize manifest for deploying the different components. Check out Kubeflow/manifests/ tf-training and from
It deploy the crd and operator. Let me know if you struggle with it!

/Niklas 

--
You received this message because you are subscribed to the Google Groups "kubeflow-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubeflow-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubeflow-discuss/778c8892-760d-4165-836f-9a081b314b0en%40googlegroups.com.

David Aronchick

unread,
Nov 18, 2020, 6:28:45 PM11/18/20
to Niklas Hansson, as...@lyft.com, kubeflow-discuss
Does this documentation help at all? https://www.kubeflow.org/docs/other-guides/kustomize/

After doing the initial configuration, you should be able to drop most manifests...


Rui Vasconcelos

unread,
Nov 19, 2020, 7:28:56 AM11/19/20
to David Aronchick, Niklas Hansson, as...@lyft.com, kubeflow-discuss
Hey Anindya,

Another option to do this would be to `juju deploy <app>` for the Kubeflow applications you want to use - docs.

Cheers,
Rui



--
Rui Vasconcelos
Product Manager - AI/ML
Canonical | Ubuntu
+351 919154539

Jeremy Lewi

unread,
Nov 19, 2020, 8:13:36 AM11/19/20
to Rui Vasconcelos, David Aronchick, Niklas Hansson, as...@lyft.com, kubeflow-discuss
Niklas is right on the money.

Our goal is to try to make application easy to install separately following the standard practice of providing YAML files that you can kubectl apply.

kubeflow/manifests will be a catalog in which you can find all of the applications. So the applications you probably want are

So as Niklas says the YAML you want is probably below.


If you run into bugs/friction that make it difficult to install individual applications it would be great to open issues against the respective application.

Thanks
J


Alessandro Festa

unread,
Nov 19, 2020, 1:22:11 PM11/19/20
to kubeflow-discuss
One way may be to use K3ai as documented here https://docs.k3ai.in we got tfjob, mpiop and pytorchop, and also katib and kubeflow pipelines as individual components supported in any Kubernetes cluster..(but if you don't have one we do also install k3s for you).
Version 2 with a full opinionated cli also available based on GO. Contributors, beta tester and simple community members super welcome!

Alex

Anindya Saha

unread,
Nov 19, 2020, 4:47:42 PM11/19/20
to Jeremy Lewi, Niklas Hansson, Rui Vasconcelos, David Aronchick, kubeflow-discuss
Thanks All for responding. 

Hi Jeremy, Niklas, David

I installed whole KF 1.1 on an AWS EKS yesterday. The yaml that was generated is attached. I could not find reference to tf-job, pytorch-job, mpi-job in that yaml, however, I saw all the operators got installed by default without me doing anything.  

I was hoping to find individual section for those 3 operators in the generated yaml and then just keep those 3 and get rid of the other kustomizeConfig sections and then re-apply so that I do not install everything.

Digging further I see they are referenced https://github.com/kubeflow/manifests/blob/master/stacks/kubernetes/kustomization.yaml and that stack is loaded in the kfctl_k8s_istio.v.1.1.0.yaml which is the reason why all the operators got installed by default. 

Option 1:
Now, should I just edit the https://github.com/kubeflow/manifests/blob/master/stacks/kubernetes/kustomization.yaml to keep the below section only
# Training Operators
- ../../pytorch-job/pytorch-job-crds/overlays/application
- ../../pytorch-job/pytorch-operator/overlays/application
- ../../tf-training/tf-job-crds/overlays/application
- ../../tf-training/tf-job-operator/overlays/application
- ../../mxnet-job/mxnet-operator/overlays/application
- ../../mpi-job/mpi-operator/overlays/application

or what should I keep and get rid off in the https://github.com/kubeflow/manifests/blob/master/stacks/kubernetes/kustomization.yaml? Or do I need to hand craft a new KfDef yaml just for those operators and do not use kftcl_k8s_istion.v.1.1.0.yaml? 

Option 2:
Or should I just individually do and that's it
kubectl apply 
There is no clear documentation in none of those operator projects and whether this will be a trial and error process to figure out any other dependency for the operators or do they depend on some common manifests? Or what a custom Kfdef yaml should look like. I am concerned whether it will lead to a rabbit hole to figure out the dependencies through the trial and error method.

Niklas I think I will need your help. If you have an yaml to share could you kindly share.

Thanks
Anindya
kfctl_k8s_istio.v1.1.0.yaml

niklas.sv...@gmail.com

unread,
Nov 19, 2020, 5:12:48 PM11/19/20
to kubeflow-discuss
The way I have deployed separate components is based upon what you mention here as Option2 which is also what I think Jeremy points you towards. 

 To deploy the tf-operator I did the following: 

1. Clone the Kubeflow/manifest repo
2. Check out your version of interest, 1.1.0 should be latest stable, 1.2.0 has 2 release candidates. 
3. Deploy the resources to k8s, inside manifest/tf-training run:
    3.1 Run: kubectl apply -k tf-job-crds/overlays/application/
    3.2 Run: kubectl apply -k tf-job-operator/overlays/application

You should now have deployed the tf-operator on your cluster, might take a couple of minutes to bring up though, it should be standalone as far as I know so you should be good to go. I have deployed tf-traning this way together with Kubeflow Pipelines standalone. The link will take you to Kubeflow Pipelines manifest in the Kubeflow/pipelines repo which could give some background, I believe all parts of Kubeflow uses Kustomize for deployments and should be available in the manifest repo and can be deployed the same way. 

For me there was actually no rabbit hole when I found out about the Kubeflow/manifest repo, let me know if it is still unclear :) 


Jeremy Lewi

unread,
Nov 19, 2020, 7:21:50 PM11/19/20
to niklas.sv...@gmail.com, kubeflow-discuss
To add on to what Niklas says. I would recommend using kustomize and avoiding KFDef. 

The two approaches you mentioned are IMO mostly equivalent. The difference is start with nothing and add only what you need (Option 1) or start with everything and remove what you don't want (Option 2).

Kustomize allows for inheritance/composition using the resources and bases field. Here are the relevant docs.

So a stack is just composing other kustomize packages which are in turn themselves compositions of other packages. So if you start with the stack kustomization.yaml you can navigate via resources & bases to the packages it is composing.

The advantage of Option 2 is that we have multiple versions of each package stored in the manifests repo. If you start with the stack kustomization.yaml file and follow resources/bases then you will be able to identify which version is actually being included in your stack. Whereas if you start with option 1 it might be harder to identify the correct version you want.

J
 


Anindya Saha

unread,
Nov 24, 2020, 3:36:16 PM11/24/20
to Jeremy Lewi, niklas.sv...@gmail.com, kubeflow-discuss

Hi Jeremy, Niklas,

Thank you folks for your pointers. I am able to deploy the TFjob, Pytorch & MPIJob operators on AWS EKS by following these steps. I am writing this for records if it helps someone in future.

# clone the kubeflow/manifests repository
$ git clone -b v1.1.0 https://github.com/kubeflow/manifests manifests-v1.1.0-branch

kubectl create namespace kubeflow

kustomize build manifests-v1.1.0-branch/tf-training/tf-job-crds/overlays/application/ | kubectl apply -f -
kustomize build manifests-v1.1.0-branch/tf-training/tf-job-operator/overlays/application/ | kubectl apply -f -

kustomize build manifests-v1.1.0-branch/pytorch-job/pytorch-job-crds/overlays/application/ | kubectl apply -f -
kustomize build manifests-v1.1.0-branch/pytorch-job/pytorch-operator/overlays/application/ | kubectl apply -f -

kustomize build manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/ | kubectl apply -f -


P.S.


For some reason kubectl apply -k was failing with unknown field "envs" errors. I believe I have a version mismatch between kubectl and kustomize


(base) asaha-mbp151:exploration asaha$ kubectl apply -k manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/

error: couldn't make target for ../../base: json: unknown field "envs"

(base) asaha-mbp151:exploration asaha$ 



Thanks
Anindya

David Aronchick

unread,
Nov 24, 2020, 3:40:54 PM11/24/20
to Anindya Saha, Jeremy Lewi, niklas.sv...@gmail.com, kubeflow-discuss
Thanks so much! Is there anything we should do upstream to make it easier? Also, any chance this also worked on 1.2?

niklas.sv...@gmail.com

unread,
Nov 24, 2020, 4:14:25 PM11/24/20
to kubeflow-discuss
Great to hear that it worked for you Anindya,  happy to help!

I think the documentation for deploying the different components could be improved. Would probably be good if the individual parts of Kubeflow had some very short docs since I guess that where people usually starts. 

Jeremy Lewi

unread,
Nov 27, 2020, 10:19:02 AM11/27/20
to niklas.sv...@gmail.com, kubeflow-discuss
Thanks for providing the instructions.

I think the issue with "kubectl -k" might be because kubectl is using an older version of kustomize.

J

Hamed Saljooghinejad

unread,
Dec 18, 2020, 6:28:33 AM12/18/20
to Jeremy Lewi, niklas.sv...@gmail.com, kubeflow-discuss
Hi 

I have tried to follow the instruction and install tf-operator independently using the manifest however when I apply the manifest it fails in both ways through "kubectl -k" and "kustomize build .. | kubectl apply" with this message
error: unable to recognize "tf-job-crds/overlays/application": no matches for kind "Application" in version "app.k8s.io/v1beta1"

Any suggestions?

Thanks,
Hamed

Hamed Saljooghinejad

unread,
Dec 18, 2020, 6:34:59 AM12/18/20
to Jeremy Lewi, niklas.sv...@gmail.com, kubeflow-discuss
Just to add it creates crds and other resources but also fails on application kind.

tf-training git:(da561762) k apply -k tf-job-crds/overlays/application
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org unchanged
error: unable to recognize "tf-job-crds/overlays/application": no matches for kind "Application" in version "app.k8s.io/v1beta1"

➜  tf-training git:(da561762) k apply -k tf-job-operator/overlays/application
serviceaccount/tf-job-dashboard unchanged
serviceaccount/tf-job-operator unchanged
clusterrole.rbac.authorization.k8s.io/kubeflow-tfjobs-admin configured
clusterrole.rbac.authorization.k8s.io/kubeflow-tfjobs-edit unchanged
clusterrole.rbac.authorization.k8s.io/kubeflow-tfjobs-view unchanged
clusterrole.rbac.authorization.k8s.io/tf-job-operator unchanged
clusterrolebinding.rbac.authorization.k8s.io/tf-job-operator unchanged
service/tf-job-operator unchanged
deployment.apps/tf-job-operator unchanged
error: unable to recognize "tf-job-operator/overlays/application": no matches for kind "Application" in version "app.k8s.io/v1beta1"

Hamed Saljooghinejad

unread,
Dec 18, 2020, 10:14:17 AM12/18/20
to Jeremy Lewi, kubeflow-discuss
ok, Just created the Application crd from kfp-tekton here: https://github.com/kubeflow/kfp-tekton/blob/master/install/v0.4.0/kfp-tekton.yaml
and things look fine now. 

Hamed

Hamed Saljooghinejad

unread,
Jan 8, 2021, 1:48:25 PM1/8/21
to Jeremy Lewi, kubeflow-discuss
Hi

following the above installation of xgboost(v1.2.0) I am getting errors on xgboost operator pod. 

k get crds
NAME                                         CREATED AT
applications.app.k8s.io                      2020-12-18T15:00:13Z
xgboostjobs.xgboostjob.kubeflow.org          2021-01-04T11:22:57Z
-------------------
This is log from the pod:kl xgboost-operator-deployment-59495c9854-zgpgh

kl xgboost-operator-deployment-59644c445-zpfjy
{"level":"info","ts":1609759383.411976,"logger":"entrypoint","msg":"setting up client for manager"}
{"level":"info","ts":1609759383.4121592,"logger":"entrypoint","msg":"setting up manager"}
{"level":"info","ts":1609759384.0154142,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1609759384.015574,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1609759384.0155895,"logger":"entrypoint","msg":"setting up scheme"}
{"level":"info","ts":1609759384.0157025,"logger":"entrypoint","msg":"Setting up controller"}
{"level":"info","ts":1609759384.0157304,"logger":"controller","msg":"Running controller in in-cluster mode"}
{"level":"info","ts":1609759384.0158665,"logger":"controller","msg":"gang scheduling is set: ","gangscheduling":false}
{"level":"info","ts":1609759384.0159233,"logger":"entrypoint","msg":"setting up webhooks"}
{"level":"info","ts":1609759384.01593,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1609759384.0162542,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1609759384.0163705,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"xgboostjob-controller","source":"kind source: /, Kind="}
{"level":"error","ts":1609759386.1145597,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"XGBoostJob.xgboostjob.kubeflow.org","error":"no matches for kind \"XGBoostJob\" in version \"xgboostjob.kubeflow.org/v1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/za...@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/pkg/mod/sigs.k8s.io/controlle...@v0.4.0/pkg/source/source.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controlle...@v0.4.0/pkg/internal/controller/controller.go:165\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controlle...@v0.4.0/pkg/internal/controller/controller.go:198\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElectionRunnables.func1\n\t/go/pkg/mod/sigs.k8s.io/controlle...@v0.4.0/pkg/manager/internal.go:477"}
{"level":"error","ts":1609759386.1146486,"logger":"entrypoint","msg":"unable to run the manager","error":"no matches for kind \"XGBoostJob\" in version \"xgboostjob.kubeflow.org/v1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/za...@v0.1.1/zapr.go:128\nmain.main\n\t/go/src/github.com/kubeflow/xgboost-operator/cmd/manager/main.go:82\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}

Thanks
Hamed
Reply all
Reply to author
Forward
0 new messages