TensorFlow GPU enabled instance on AWS. Install and configuring step by step.

1,864 views
Skip to first unread message

Evgeny Shaliov

unread,
Nov 18, 2015, 1:12:39 PM11/18/15
to Discuss



  1. Create an instance on AWS

  1. Create the AWS instance from “Amazon Linux AMI with NVIDIA GRID GPU Driver”

  2. I have choosen g2.2xlarge, 16GB (8GB could be not enough) of SSD.

3. Configure security, generate (or reuse) key pair (or password) for access to the instance


  1. Configure environment


1. Login on the remote instance using SSH (default username: ec2-user).

2. Check availability of Python 2.7, Pip, Cuda on the instance (every application is installed)

   - the AMI contains Python 2.7.10 that is required for TensorFlow (python --version),

   - the AMI contains Pip 6.1.1 (pip --version)

   - the AMI contains Cuda 6.5.12 (nvcc --version)

3. Install CUDA toolkit 7.0 (7.5 is not valid). The toolkit is available on https://developer.nvidia.com/cuda-toolkit-archive

wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run

sh cuda_7.0.28_linux.run (follow the instructions)


Do you accept the previously read EULA? accept

You are attempting to install on an unsupported configuration. Do you wish to continue? y

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 346.46? n

Do you want to install the OpenGL libraries? ((y)es/(n)o/(q)uit) n

Install the CUDA 7.0 Toolkit? y

Enter Toolkit Location [ default is /usr/local/cuda-7.0 ]:

Do you want to install a symbolic link at /usr/local/cuda? y

Install the CUDA 7.0 Samples? n


4. Download cuDNN

The library cuDNN from https://developer.nvidia.com/rdp/cudnn-download is not valid for the current environment!

Download cuDNN from https://developer.nvidia.com/rdp/cudnn-archive (It requires registration. Follow instructions of https://developer.nvidia.com/cuDNN (registration could take 1-2 US days)).


5. Install cuDNN v2 (important - 6.5 v2).


tar -zxf cudnn-6.5-linux-x64-v2.tgz

cd cudnn-6.5-linux-x64-v2

sudo cp -R lib* /usr/local/cuda/lib64/

sudo cp cudnn.h /usr/local/cuda/include/


6. Add the environment variables into ~/.bashrc:


export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"

export CUDA_HOME=/usr/local/cuda


7. Install TensorFlow


sudo pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.5.0-cp27-none-linux_x86_64.whl


Expected result

Installing collected packages: six, numpy, tensorflow

 Found existing installation: six 1.8.0

   Uninstalling six-1.8.0:

     Successfully uninstalled six-1.8.0

 Running setup.py install for numpy

Successfully installed numpy-1.10.1 six-1.10.0 tensorflow-0.5.0


8. Check configured environment is correct. Open a python terminal:


$ python

>>> import tensorflow as tf

>>> hello = tf.constant('Hello, TensorFlow!')

>>> sess = tf.Session()

>>> print sess.run(hello)

Hello, TensorFlow!

>>> a = tf.constant(10)

>>> b = tf.constant(32)

>>> print sess.run(a+b)

42

>>>


  1. Run a sample


1. Install Git


yum install git -y


2. Clone the project


git clone --recurse-submodules https://github.com/tensorflow/tensorflow


3. Run tensorflow neural net model


python tensorflow/tensorflow/models/image/mnist/convolutional.py


Initialized!

Epoch 0.00

Minibatch loss: 12.054, learning rate: 0.010000

Minibatch error: 90.6%

Validation error: 84.6%

Epoch 0.12

Minibatch loss: 3.285, learning rate: 0.010000

Minibatch error: 6.2%

Validation error: 7.0%

Epoch 0.23

Minibatch loss: 3.473, learning rate: 0.010000

Minibatch error: 10.9%

Validation error: 3.7%

Epoch 0.35

Minibatch loss: 3.221, learning rate: 0.010000

Minibatch error: 4.7%

Validation error: 3.2%

….


Tim Shephard

unread,
Nov 18, 2015, 1:20:50 PM11/18/15
to Evgeny Shaliov, Discuss
Very cool! Thanks for this. I am hoping Amazon upgrades their gpus or lowers their costs to be more competitive with homegrown systems. Such a pain maintaining servers and aws does such a terrific job.
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/893019cf-368c-4506-b434-af1b96b0eba8%40tensorflow.org.

eva...@gmail.com

unread,
Nov 24, 2015, 4:42:22 PM11/24/15
to Discuss
Thanks for the writeup! I'm seeing the following output when running the mnist example, which does not seem to be present in your expected output. Any thoughts on what could be different?

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 4.00GiB
Free memory: 3.95GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:611] Ignoring gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8

Denny Britz

unread,
Nov 25, 2015, 12:35:15 AM11/25/15
to Discuss, eva...@gmail.com
The AWS GPU only supports CUDA 3.0, Tensorflow by default is > 3.5. You need to compile Tensorflow from source and specify 3.0 as the CUDA compatibility to run it on AWS. There's a discussion related to that at https://github.com/tensorflow/tensorflow/issues/25

Here are the commands I used on a plain Ubuntu AMI: https://gist.github.com/dennybritz/8c2ca115b72ea98e5192

Evgeny Shaliov

unread,
Nov 25, 2015, 7:33:13 AM11/25/15
to Discuss, eva...@gmail.com
Probably CUDA version is 3.0. Could you check it?
Please look at 2.2, 2.3.
Could you provide more details what you run?

eva...@gmail.com

unread,
Nov 25, 2015, 11:28:38 AM11/25/15
to Discuss, eva...@gmail.com
Thanks, after following the instruction I get a compile error:
From Compiling tensorflow/core/kernels/bias_op_gpu.cu.cc:
tensorflow/core/kernels/bias_op_gpu.cu.cc(40): error: identifier "__ldg" is undefined

Googling a bit, it seems that __ldg is only available in CUDA 3.5, yet I've specified 3.0 in the config.
I'm going to try an older commit of tensorflow, but was wondering if you encountered this error?

eva...@gmail.com

unread,
Nov 25, 2015, 12:30:12 PM11/25/15
to Discuss, eva...@gmail.com
I was able to get everything to work from commit d50565b35e886e7c3a201ea2f088790ed4b28de4 following the instructions at https://gist.github.com/dennybritz/8c2ca115b72ea98e5192

I added that one line to get the correct commit in a forked version:
https://gist.github.com/evahlis/0761ed5457daeffcb9f3

tedd...@gmail.com

unread,
Dec 10, 2015, 1:39:12 AM12/10/15
to Discuss, eva...@gmail.com
Make sure to use bazel 0.1.1! I had troubles until I switched to the earlier version.
Reply all
Reply to author
Forward
0 new messages