singularity install on RHEL7 + GPU

638 views
Skip to first unread message

Valentin Kozlov

unread,
Feb 24, 2018, 7:28:23 PM2/24/18
to singularity
Hi all,

I am a bit experimenting with singularity and trying to install it on AWS, RHEL7 AMI (ami-c90195b0). 
I first install Nvidia stuff by downloading rpm from nvidia, cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64. I clone singularity from git, compile it and install it (though no mksquashfs was installed).  nvidia-smi outputs 387.26

However, when I run under unprivileged user: singularity shell --nv docker://tensorflow/tensorflow:latest-gpu

I get following error messages:
~~~~~~~~~~
failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-172-31-20-167.eu-west-1.compute.internal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-172-31-20-167.eu-west-1.compute.internal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Invalid argument: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
~~~~~~~~~~
I can still run nvidia-smi inside container and it produces right output.

'Funny' enough, if I install docker-ce and nvidia-docker, I can run same container in nvidia-docker, BUT I then can also run my command above, i.e. no error message. It seems to be related to the fact, that nvidia-docker puts additional kernel drivers in memory.

Any idea how to avoid the error without installing nvidia-docker?

Best,
Valentin

Azat Khuziyakhmetov

unread,
Feb 26, 2018, 10:06:19 AM2/26/18
to singularity
Hi Valentin,

Try to use --nv flag of singularity. It will also bind nvidia drivers from the host machine. 

If anyone knows how to properly bind the drivers from the host manually to the container (without installing them via apt) please reply too :) thank you,

Best regards,
Azat

Valentin Kozlov

unread,
Feb 26, 2018, 10:14:16 AM2/26/18
to singularity
Hi Azat,

thank you for your reply, but the flag "--nv" is exactly what I do:

> under unprivileged user: singularity shell --nv docker://tensorflow/tensorflow:latest-gpu

Inside container I also get right response when I invoke "nvidia-smi" but tensorflow does not want to start :-(

Best,
Valentin

Azat Khuziyakhmetov

unread,
Feb 26, 2018, 10:27:36 AM2/26/18
to singu...@lbl.gov
Hi Valentin,

> thank you for your reply, but the flag "--nv" is exactly what I do:

Opps, sorry, haven't noticed. I had some problems with nvidia/cuda images for nvidia-docker (I was using Ubuntu version), maybe it is the case with tensorflow image too. 

The problem was that CUDA libraries were located in different directories. Within the image they were at /usr/local/cuda-9.0/lib64 but env variable LD_LIBRARY_PATH was set to /usr/local/nvidia, which is created only by nvidia-docker. So try to find the cuda location in the container and append the LD_LIBRARY_PATH variable with that path. 

Best regards,
Azat

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

David Godlove

unread,
Feb 26, 2018, 12:53:18 PM2/26/18
to singu...@lbl.gov
Hello Valentin,

Can you try with debug messaging enabled and provide output please?  Thanks!

Dave

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Jared David Baker

unread,
Feb 27, 2018, 12:55:35 AM2/27/18
to singu...@lbl.gov

Hello Valentin,

 

I've been playing around with TensorFlow 0.12 in a CentOS 7 image (and others). I've issued the command `nvidia-cuda-mps-server` before starting Singularity image (i.e., on the host) and it seems to let TensorFlow work correctly. I'm not completely sure why just yet. This is with CUDA 8 and cudnn 6. Still looking into this issue but might get you further along.

 

jared

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

 

--
You received this message because you are subscribed to the Google Groups "singularity" group.

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Valentin Kozlov

unread,
Feb 28, 2018, 2:55:47 AM2/28/18
to singularity
Hello Jared, all,

thank you, it worked! I tried to check environment settings before and after evoking 'nvidia-cuda-mps-server' and I see no difference (see attached aws-rhel7-lsmod+env.tar.gz), while if I check what modules are loaded (same tar.gz file, there  compare _justlogged.out and -after-nv-mps_under_tuser.out), I see that after executing 'nvidia-cuda-mps-server' under a _unprivileged user_ , 'nvidia_uvm' is added to the list of loaded modules.

@David, I also run singularity in debug mode for both cases (before and after issuing 'nvidia-cuda-mps-server'), see aws-rhel7-singularity-debug.tar.gz . Files are of different size but I could not find what is the difference.

It does look to me more Nvidia problem, or AWS?

Cheers,
Valentin
aws-rhel7-lsmod+env.tar.gz
aws-rhel7-singularity-debug.tar.gz
Reply all
Reply to author
Forward
0 new messages