Re: [jupyter] Kubernetes NVIDIA GPU/extraVolumeMount issues

133 views
Skip to first unread message
Message has been deleted

Chia-liang Kao

unread,
Sep 13, 2018, 1:05:39 PM9/13/18
to jup...@googlegroups.com
Hi,

1. for user home pvc, make sure you have correct fsGid configured. if you use docker-stack (jupyter/*) based notebook, it should also try properly to chown the user home directory before su into the jovyan user.

2. is your single user image with the tensorflow-gpu or tensorflow package? beware that conda can pull non-gpu version from mixed channels even if you specifically install tensorflow-gpu.

3. limit: 0 does not take away GPUs. you need to configure NVIDIA_VISIBLE_DEVICES=none as extra env in this case.

Best,
clkao


Benedikt Bäumle <benedikt...@gmail.com> 於 2018年9月13日 週四 下午6:53寫道:
Hey guys,

I am currently setting up a Kubernetes bare-metal single node cluster + Jupyterhub for having control over resources for our users. I use Helm to set up jupyterhub with a custom singleuser-notebook image for deep learning.

The idea is to set up the hub to have better control over NVIDIA GPUs on the server.

I am struggling with some things I can't figure out how to do or if that is even possible:

1. I mount the home directory of the user to the notebook user ( in our case /home/dbvis/ ) in the helm chart values.yaml:

    extraVolumes:
        - name: home
          hostPath:
            path: /home/{username}
    extraVolumeMounts:
        - name: home
          mountPath: /home/dbvis/data

It is indeed mounted like this, but with root:root ownership and I can't add/remove/change anything inside the container at /home/dbvis/data. What is tried out:

- I tried to change the ownership in the Dockerfile by running 'chown -R dbvis:dbvis /home/dbvis/' in the end as root user
- I tried through the following postStart hook in the values.yaml

    lifecycleHooks:
      postStart:
        exec:
          command: ["chown", "-R", "dbvis:dbvis", "/home/dbvis/data"]

Both didn't work...as storageclass I set up rook with rook-ceph-block storage.
Any ideas?


2. We have several NVIDIA GPUs and I would like to control them and set limits for the jupyter singelser-notebooks. I set up 'nvidia device plugin' ( https://github.com/NVIDIA/k8s-device-plugin ). 
When I use 'kubectl describe node' I find the GPU as resource:

Allocatable:
 cpu:                16
 ephemeral-storage:  189274027310
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             98770548Ki
 nvidia.com/gpu:     1
 pods:               110
...
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource        Requests     Limits
  --------        --------     ------
  cpu             2250m (14%)  4100m (25%)
  memory          2238Mi (2%)  11146362880 (11%)
  nvidia.com/gpu  0            0
Events:           <none>

Inside the jupyter singleuser-notebooks I can see the GPU, when executing 'nvidia-smi'.
But if I run e.g. tensorflow to see the GPU with the following code:

from tensorflow.python.client import device_lib

device_lib.list_local_devices()
I just get the CPU device:

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 232115754901553261]

Any idea what I am doing wrong? 

Further, I would like to limit the amount of GPUs ( It is just a test environment with one GPU we have more ). I tried the following which doesn't seem to have an effect:

- Add the following config in values.yaml in any combination possible:

  extraConfig: |
     c.Spawner.notebook_dir = '/home/dbvis'
     c.Spawner.extra_resource_limits: {'nvidia.com/gpu': '0'}
     c.Spawner.extra_resource_guarantees: {'nvidia.com/gpu': '0'}
     c.Spawner.args = ['--device=/dev/nvidiactl', '--device=/dev/nvidia-uvm', '--device=/dev/nvidia-uvm-tools', '/dev/nvidia0' ]

- Add the GPU to the resources in the singleuser configuration in values.yaml:

singleuser:
  image:
    name: benne4444/dbvis-singleuser
    tag: test3
    limit: 1
    guarantee: 1

Is what I am trying even possible right now?

Further information:

I set up a server running 

- Ubuntu 18.04.1 LTS
- docker-nvidia
- helm jupyterhub version 0.8-ea0cf9a

I added the complete values.yaml.

If you need additional information please let me know. Any help is appreciated a lot.

Thank you,
Benedikt




--
You received this message because you are subscribed to the Google Groups "Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+u...@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/585d4d0b-5d8d-4cf2-b109-2c16f93d1f62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Benedikt Bäumle

unread,
Sep 26, 2018, 1:20:34 PM9/26/18
to Project Jupyter
Fixed. Can be deleted.

Am Donnerstag, 13. September 2018 19:05:39 UTC+2 schrieb Chia-liang Kao:
Hi,

1. for user home pvc, make sure you have correct fsGid configured. if you use docker-stack (jupyter/*) based notebook, it should also try properly to chown the user home directory before su into the jovyan user.

2. is your single user image with the tensorflow-gpu or tensorflow package? beware that conda can pull non-gpu version from mixed channels even if you specifically install tensorflow-gpu.

Jupyter Notebook didn't give me any log messages. Having a look on the logs in a python terminal showed me that my test graphic card was not compatible.
 

3. limit: 0 does not take away GPUs. you need to configure NVIDIA_VISIBLE_DEVICES=none as extra env in this case.

The incompatibility of my graphic card was also the problem.
Reply all
Reply to author
Forward
0 new messages