using CUDA on HPC

42 views
Skip to first unread message

Clive Darwell

unread,
Mar 4, 2020, 3:53:48 AM3/4/20
to Selene (sequence-based deep learning package)
I'm trying to get Selene running with CUDA - on my Institution's Linux HPC

The following CUDA modules are availableon the system:

CUDA/10.0.130  
CUDA/10.1.243-GCC-8.3.0    
CUDA/10.1.243 

However, when I run the simple_train.yml file I get the error: Found no NVIDIA driver on your system

Is it possible to invoke CUDA with Selene on this system?

Many thanks

Clive

Dat Duong

unread,
Mar 4, 2020, 4:03:40 AM3/4/20
to Selene (sequence-based deep learning package)
I was able to run it by using 

CUDA_VISIBLE_DEVICES=0 python -u ../../../selene_cli.py ./train_online_sampler.yml --lr=0.08

However, even when you don't use CUDA_VISIBLE_DEVICES=0, the python will default to GPU 0. I would suggest that you check if you can view all visible GPU by using nividia-smi. 






Clive Darwell

unread,
Mar 4, 2020, 8:05:32 PM3/4/20
to Selene (sequence-based deep learning package)
Now I get an error:

RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /opt/conda/conda-bld/pytorch_1533739672741/work/aten/src/THC/THCGeneral.cpp:74

Kathy Chen

unread,
Mar 5, 2020, 11:15:37 AM3/5/20
to Selene (sequence-based deep learning package)
Hi Clive,

This looks like an issue with your PyTorch installation being incompatible with your current CUDA version. You should check that the cudatoolkit you've installed matches the version of CUDA on your machine.

Thanks!
Kathy

Clive Darwell

unread,
Mar 5, 2020, 7:39:00 PM3/5/20
to Selene (sequence-based deep learning package)
Hi there Kathy

On my system, module avail gives me:

CUDA/10.0.130
CUDA/10.1.243-GCC-8.3.0
CUDA/10.1.243

In Python when I type:

import torch
torch.version.cuda
I get: '10.1'


I also tried with cudatoolkit=9.2

Any thoughts?

Thanks

Clive

Kathy Chen

unread,
Mar 5, 2020, 9:58:07 PM3/5/20
to Selene (sequence-based deep learning package)
Hmm... not sure. What version of PyTorch are you using? For CUDA versions above 10, I use PyTorch 1.0 and above, and toolkit cuda100. 

Clive Darwell

unread,
Mar 5, 2020, 10:23:51 PM3/5/20
to Selene (sequence-based deep learning package)
Hi

So, python -c "import torch; print(torch.__version__, torch.version.cuda)"

gives me: 1.4.0 10.1

??

Kathy Chen

unread,
Mar 5, 2020, 10:32:40 PM3/5/20
to Selene (sequence-based deep learning package)
Did you ever run `nvidia-smi` like Dat mentioned in the initial post? That can give you more info on the CUDA driver version 

Clive Darwell

unread,
Mar 5, 2020, 10:33:56 PM3/5/20
to Selene (sequence-based deep learning package)
I just get command not found in my Linux terminal!

Kathy Chen

unread,
Mar 5, 2020, 10:40:51 PM3/5/20
to Selene (sequence-based deep learning package)
And you're SSH'd into the GPU node where you would be running the job? `torch.cuda.is_available()` I'm guessing returns False then?

There might not be any NVIDIA CUDA drivers installed on that machine then...? In which case you'll need to get that installed before being able to run stuff on the GPU. 

Jian Zhou

unread,
Mar 5, 2020, 10:43:20 PM3/5/20
to Selene (sequence-based deep learning package)
Did you ever load the CUDA drivers by the way?
 like  `module load CUDA/10.1.243`

Clive Darwell

unread,
Mar 5, 2020, 10:44:22 PM3/5/20
to Selene (sequence-based deep learning package)
OK let me talk to the IT team here

Thanks for your help

Clive Darwell

unread,
Mar 5, 2020, 10:49:35 PM3/5/20
to Selene (sequence-based deep learning package)
Hi jzhoup

Yes, I tried that thanks

C

Clive Darwell

unread,
Mar 8, 2020, 10:11:39 PM3/8/20
to Selene (sequence-based deep learning package)
So, further problems. SSH-ing into the GPU partition initially solved the problem - but later during the run I get a "389404 bus error" as the pgm attempts to create the test dataset. As I understand, a bus error is when memory can't be accessed.

Any thoughts on this?

Many thanks

Clive

Kathy Chen

unread,
Mar 11, 2020, 11:06:24 AM3/11/20
to Selene (sequence-based deep learning package)
Maybe you need to allocate more memory? You could also just set `load_test_set` to False for now http://selene.flatironinstitute.org/overview/cli.html#general-configurations to see if it will run in the first place. Can you run on CPU nodes btw? Just to be sure that you've been able to run Selene in general and the main problem is getting it to work on the GPU node? 

Clive Darwell

unread,
Mar 13, 2020, 3:29:00 AM3/13/20
to Selene (sequence-based deep learning package)
Hi

Yes I can run (very slowly with CPU). I set the memory of the GPU partition to 384Gb and it runs but still gets the "bus error".

Is the getting started tutorial with "simple_train.yml" quite an extensive analysis? 

Thanks

Clive

Kathy Chen

unread,
Mar 16, 2020, 9:37:25 AM3/16/20
to Selene (sequence-based deep learning package)
No, it's not an extensive analysis. Unfortunately, it's hard for us to debug an error that is not explicitly Selene-related, since there's so little information to go off of. :( Did you try changing the parameters to see if you get the Bus error regardless of what step it's on in training?

Clive Darwell

unread,
Mar 16, 2020, 8:51:18 PM3/16/20
to Selene (sequence-based deep learning package)
OK thanks for your help - I will talk to our IT people

C

Kathy Chen

unread,
Mar 17, 2020, 3:47:58 PM3/17/20
to Selene (sequence-based deep learning package)
Sounds good, happy to help when you have more updates on this!
Reply all
Reply to author
Forward
0 new messages