[Rocks-Discuss] Creating a Rocks-Based GPU Cluster

Jon Forrest

unread,

Feb 2, 2011, 5:08:21 PM2/2/11

to Rocks-discuss

As promised, I've written up my notes on how
to create a hybrid GPU cluster. I haven't tested
this as much as I'd like since I had to do this
quickly on a production cluster so I'd welcome corrections.

-----------------
Creating a Rocks-Based GPU Cluster

Jon Forrest (jlfo...@berkeley.edu)

version .1 2/2/2011

What follows is the way Iï¿½ve created a Rocks-based GPU cluster. Iï¿½m not
saying that this is the only way. Although this method works, Iï¿½m the
first to admit that there are a few details that I should examine more
deeply. However, since I donï¿½t have a test GPU cluster, that will have
to wait. I welcome any comments, questions, or corrections.

My GPU cluster consists of a frontend with a 6-core AMD Opteron 2427,
16GB of RAM, and no GPU cards. There are also 7 compute nodes each with
a 4-core Xeon E5504, 8GB of RAM, and 2 Nvidia GTX460 graphics cards with
1GB of RAM. The frontend and compute nodes are connected via a gigabit
ethernet switch. The cluster is running Rocks 5.3 x86_64 but I believe
this method will also work with newer Rocks releases. This note assumes
youï¿½ve already have the cluster frontend up and running, and all you
need to do is to add the compute nodes.

The first thing you need to do is to download the latest version of the
Nvidia CUDA software. As Iï¿½m writing this, the latest version is 3.2 so
everything in this note will reflect this. If newer versions of the
Nvidia software are released, any links and filenames in this note will
need to be changed to reflect the new version numbers. With that in
mind, go to

http://developer.nvidia.com/object/cuda_3_2_downloads.html

and download the developer drivers, the CUDA Toolkit for RedHat
Enterprise Linux, and the GPU Computing SDK. At the time I wrote this,
the links for the 64-bit versions of these things are:

http://developer.download.nvidia.com/compute/cuda/3_2_prod/drivers/devdriver_3.2_linux_64_260.19.26.run
http://www.nvidia.com/object/thankyou.html?url=/compute/cuda/3_2_prod/toolkit/cudatoolkit_3.2.16_linux_64_rhel5.5.run

The SDK is architecture but not OS independent, and is at

http://developer.download.nvidia.com/compute/cuda/3_2_prod/sdk/gpucomputingsdk_3.2.16_linux.run

Put these files in /export/rocks/install/contrib/5.3/x86_64.

Next, modify line 10 in cudatoolkit_3.2.16_linux_64_rhel5.5.run to be

scriptargs="auto"

This changes the way the CUDA toolkit is installed so that you arenï¿½t
asked for anything during installation. This is necessary so that your
compute nodes can be installed without requiring user input. Anything
that requires user input as install time would default the whole purpose
of Rocks.

The next step is to install the CUDA software on the frontend. Change
directory to /export/rocks/install/contrib/5.3/x86_64 and run the
following commands:

./devdriver_3.2_linux_64_260.19.21.run -s --no-kernel-module

./cudatoolkit_3.2.16_linux_64_rhel5.5.run

Since this software puts various libraries in /usr/local/cuda/lib64 so
you need to add this directory to the end of /etc/ld.so.conf and run

ldconfig

so that these libraries will be available to the CUDA software running
on the frontend.

If you chose to install the SDK, you should run

yum install libXi-devel freeglut freeglut-devel

so that you can build the example programs. Then, run

./gpucomputingsdk_3.2.16_linux.run

When asked for the install path, enter /usr/local/cuda . This puts the
SDK files under /usr/local/cuda/C/ . Change to this directory and type

make

If you donï¿½t have any GPUs on the frontend you wonï¿½t be able to run any
of these programs on the frontend so donï¿½t be surprised. Copy the file
/usr/local/cuda/C/bin/linux/release/deviceQuery to
/export/rocks/install/contrib/5.3/x86_64. Youï¿½ve now setup your
frontend. Next comes modifying your Rocks distribution so that the
compute nodes are running the CUDA software.

In /export/rocks/install/contrib/5.3/x86_64 create a file called
ï¿½all.shï¿½ and put the following text in it:

./devdriver_3.2_linux_64_260.19.21.run -s

./cudatoolkit_3.2.16_linux_64_rhel5.5.run

echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf

ldconfig

cp deviceQuery /tmp

/tmp/deviceQuery

These commands will run on the compute nodes when Rocks is installed.
The reason why ï¿½deviceQueryï¿½ is run is to check that the software has
been installed correctly and that the GPU card is usable.

In /export/rocks/install/site-profiles/5.3/nodes/extend-compute.xml put
the following text in the <post> section:

cd /tmp

wget http://127.0.0.1/install/contrib/5.3/x86_64/all.sh

wget
http://127.0.0.1/install/contrib/5.3/x86_64/cudatoolkit_3.2.16_linux_64_rhel5.5.run

wget
http://127.0.0.1/install/contrib/5.3/x86_64/devdriver_3.2_linux_64_260.19.21.run

wget http://127.0.0.1/install/contrib/5.3/x86_64/deviceQuery

wget
http://127.0.0.1/install/contrib/5.3/x86_64/gpucomputingsdk_3.2.16_linux.run

chmod 744 *

/tmp/all.sh > /tmp/all.out
These commands copy the various CUDA files to the compute nodes and then
run the ï¿½all.shï¿½ file you created above to actually install the CUDA
software. Finally, run the following commands

cd /export/rocks/install

rocks create distro

and then PXE-boot the compute nodes to install Rocks on them.

Some contributors to the Rocks email list have said that itï¿½s necessary
to run

mknod -m 660 /dev/nvidia*

in order for the GPUs to be usable by non-root users. I havenï¿½t found
this to be necessary but Iï¿½m not sure why. Maybe by running the
ï¿½deviceQueryï¿½ command Iï¿½m somehow fixing this problem.

I havenï¿½t attacked the question of the best way to configure SGE to
handle GPU cards. Iï¿½d welcome any suggestions.

Dave Kraus

unread,

Feb 3, 2011, 11:07:11 AM2/3/11

to Discussion of Rocks Clusters

On 02/02/2011 05:08 PM, Jon Forrest wrote:
> ...

> I havenï¿½t attacked the question of the best way to configure SGE to
> handle GPU cards. Iï¿½d welcome any suggestions.

First of all, thank you for writing up the install procedure! I need to
upgrade our prototype cluster, which I used ClusterCorp's roll for
initially, and will probably use your methodology in the next incarnation.

SGE integration is the kicker, tho. Our compute nodes each have 2 GPU
cards. The problem is, without programmatic intervention in the early
CUDA 2.x drivers, there was no way to explicitly tell an executing
program to use GPU 1 vs GPU 0.

Apparently, now, using SMI to put the drivers into "exclusive" mode, and
running another daemon (?), when a program requests a GPU context, it
either gets a free GPU, or the call fails. (Ref "CUDA C Best Practices
Guide" Version 3.2, section 8.3.)

So, given that, if the drivers are put into exclusive mode somehow (on
boot), and the necessary magic happens, then one should be able to set
up SGE with a GPU consumable resource for each compute node, equal to
the number of GPUs. Then, when a CUDA-enabled program is submitted, it
needs to request the number of GPUs it needs up to the number in the
node, and SGE then can manage when and where jobs get run with that
constraint.

At least, this is my current theory. I haven't had the opportunity to
actually set any of this up yet, but probably will need to in the next
couple weeks.

So the question is, has anybody else done this yet and be willing to
share whether it works or where it fails?

"Hung-Sheng Tsao (Lao Tsao 老曹) Ph. D."

unread,

Feb 3, 2011, 11:26:45 AM2/3/11

to npaci-rocks...@sdsc.edu

check out the bass cluster@UNC
http://wwwx.cs.unc.edu/Research/bass/index.php/Bass_Wiki
will give you some idea with SGE&GPU
regards

On 2/3/2011 11:07 AM, Dave Kraus wrote:
> On 02/02/2011 05:08 PM, Jon Forrest wrote:
>> ...

>> I haven’t attacked the question of the best way to configure SGE to
>> handle GPU cards. I’d welcome any suggestions.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 343 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110203/3fae8176/laotsao.vcf

Dung N Do

unread,

Feb 3, 2011, 11:49:37 AM2/3/11

to Discussion of Rocks Clusters

Here's how I set up my SGE for GPU

- Create a forced complex, call it 'cudaonly'
- Create a new queue , called it cuda.q, and assign it with the forced
complex
- create a @cudahosts group and assign it to cuda.q
- cuda.q has slots of 2 for each cuda host
- use nvidia-smi to change all GPU to compute exclusive

and finally,

- instruct users to submit their GPU jobs using qsub with -l cudaonly and -q
cuda.q

In my case, all of my GPU nodes have 8 CPU cores. I added the @cudahosts
group into all.q but only giving these hosts 6 slots. That way, non-GPU jobs
can run on the GPU node without using the GPU.

Works pretty well for me that way.

Regards.

On Thu, Feb 3, 2011 at 10:07 AM, Dave Kraus <kr...@mtu.edu> wrote:

> On 02/02/2011 05:08 PM, Jon Forrest wrote:
>
>> ...
>>

>> I haven’t attacked the question of the best way to configure SGE to
>> handle GPU cards. I’d welcome any suggestions.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110203/23c17b5d/attachment.html

Reply all

Reply to author

Forward