[WIP] Julia code on CUDA hardware

maleadt

unread,

Dec 19, 2014, 11:38:42 AM12/19/14

to juli...@googlegroups.com

Hi all, (warning: wall of text ahead)

During the last months, I have been working on and off on PTX support for the
Julia compiler and an accompanying package for interacting with the CUDA driver.
The current state allows for code such as:

using CUDA

@target ptx function kernel_vadd(a::CuDeviceArray{Float32}, b::CuDeviceArray{Float32},
                                 c::CuDeviceArray{Float32})
    i = blockId_x() + (threadId_x()-1) * numBlocks_x()
    c[i] = a[i] + b[i]

    return nothing
end

dev = CuDevice(0)
ctx = CuContext(dev)
cgctx = CuCodegenContext(ctx, dev)

dims = (3, 4)
a = round(rand(Float32, dims) * 100)
b = round(rand(Float32, dims) * 100)
c = Array(Float32, dims)

len = prod(dims)
@cuda (len, 1) kernel_vadd(CuIn(a), CuIn(b), CuOut(c))

@show a+b == c

destroy(cgctx)
destroy(ctx)

which is pretty neat I think :-)

However, due to an unfortunate turn of events I won't be able to spend much time
on this in the future. Consequently, I'm making this code public already so
anyone interested can look into it. I had hoped to polish it up a bit more
before exposing it to the masses.

A quick overview of the components:

1) Modified Julia compiler
https://github.com/maleadt/julia

Diff against upstream:

@target macro for adding a function-level "target" field to linfo
a separate set of LLVM objects for PTX codegen (module, pass manager, etc)
new exported functions for accessing the PTX code
some codegen exceptions spread over the codegen to generate valid PTX code

Note that I tried to keep the codegen modifications to a minimum. Most changes
come from using an address-preserving bitcast, because PTX supports multiple
address spaces. Other changes can actually be reverted, see the TODO.md for more
information.

2) CUDA.jl runtime
https://github.com/maleadt/CUDA.jl

Based on JuliaGPU/CUDA.jl. Heavily modified (still incorporates original
functionality though):

native execution support
lightwight on-device arrays
improved API consistency
many new features

The most significant part is obviously the native execution support. The package
defines an @cuda macro, which compiles the called Julia function to PTX code,
and interacts with the driver (creating and uploading a module, packing
arguments, launching the kernel). It also supports seamless passing of Julia
arrays, mimicking the pass-by-sharing convention, and providing wrapper types to
indicate whether the object is input/output.

Most of the functionality in @cuda is built using staged functions, meaning it
only takes place once and the runtime overhead is minimal. This means that in
principle Julia+CUDA should be able to achieve the same performance as NVCC-
compiled .cu files :-)

There are plenty of limitations though:

cannot alloc in PTX code
only can pass bitstypes, arrays or pointer
stdlib functionality is unavailable, because it lives in another module

In short: it is only usable for relatively simple kernels with non-complex data
interactions.

Now, for the usage instructions. First, compile the modified Julia with PTX
codegen support:

$ git clone https://github.com/maleadt/julia.git
$ cd julia
$ make LLVM_VER=3.5.0

Optionally:
$ make LLVM_VER=3.5.0 testall

NOTE: the compiler requires `libdevice` to link kernel binaries. This library is
only part of recent CUDA toolkits (version 5.5 or greater). If you use older an
older CUDA release (for example because you use the GPU Ocelot emulator which
only supports up to CUDA 5.0) you _will_ need to get a hold of these files.
Afterwards, you can point Julia to these files using the NVVMIR_LIBRARY_DIR
environment variable.

Secondly, install the CUDA driver API:

$ ./julia
> Pkg.clone("https://github.com/maleadt/CUDA.jl.git")

Optionally:
$ ./julia
> Pkg.test("CUDA")

Lastly, if you don't have any CUDA hardware, you can use the GPU ocelot
emulator:

$ git clone --recursive https://github.com/maleadt/gpuocelot.git
$ cd gpuocelot
$ $EDITOR ocelot/scripts/build_environment.py # edit nvcc invocation -- see README
$ CUDA_BIN_PATH=/opt/cuda-5.0/bin CUDA_LIB_PATH=/opt/cuda-5.0/lib CUDA_INC_PATH=/opt/cuda-5.0/include \
python2 build.py --install -p $(realpath ../julia/usr)

NOTE: this might not compile flawlessly. You'll need at least the CUDA APIs, gcc
4.6, scons, LLVM 3.5 and Boost. Also always --install into some directory, and
only use absolute directories because of build system restrictions. Have a look
at the README.

$ ./julia
> using CUDA
> CUDA_VENDOR
"Ocelot"

I hope this stuff can prove useful to somebody :-)

I've updated each of the repositories (julia, CUDA.jl, gpuocelot) with README
and TODO files, but if there are any questions feel free to contact me.

Tim

Valentin Churavy

unread,

Dec 19, 2014, 12:01:26 PM12/19/14

to juli...@googlegroups.com

Awesome work, I was planning to look into something similar for OpenCL/SPIR over the holidays. I think a good way forward might be to pull out independent changes from your julia fork and submit PR for them so that those changes can be properly discussed. Since I am also interested in this topic and would like to extend this in the direction of enabling the SPIR llvm backend I would offer my help doing so.

But since I hardly know what I am doing. I would appreciate help from anybody else :)

Best Valentin

Tim Holy

unread,

Dec 19, 2014, 12:34:29 PM12/19/14

to juli...@googlegroups.com

Really awesome. I am definitely looking at playing with this.

Now that we have two major expansions of CUDA (yours, and CUDArt,
https://github.com/JuliaGPU/CUDArt.jl), we should definitely look into what
needs to be done to merge their functionality.

--Tim

> *1) Modified Julia compiler*

> https://github.com/maleadt/julia
>
> Diff against upstream:
>

> - @target macro for adding a function-level "target" field to linfo
> - a separate set of LLVM objects for PTX codegen (module, pass manager,
> etc)
> - new exported functions for accessing the PTX code
> - some codegen exceptions spread over the codegen to generate valid PTX

> code
>
>
> Note that I tried to keep the codegen modifications to a minimum. Most
> changes
> come from using an address-preserving bitcast, because PTX supports multiple
> address spaces. Other changes can actually be reverted, see the TODO.md for
> more
> information.
>

> *2) CUDA.jl runtime*

> https://github.com/maleadt/CUDA.jl
>
> Based on JuliaGPU/CUDA.jl. Heavily modified (still incorporates original
> functionality though):
>

> - native execution support
> - lightwight on-device arrays
> - improved API consistency
> - many new features

>
>
> The most significant part is obviously the native execution support. The
> package
> defines an @cuda macro, which compiles the called Julia function to PTX
> code,
> and interacts with the driver (creating and uploading a module, packing
> arguments, launching the kernel). It also supports seamless passing of Julia
> arrays, mimicking the pass-by-sharing convention, and providing wrapper
> types to
> indicate whether the object is input/output.
>
> Most of the functionality in @cuda is built using staged functions, meaning
> it
> only takes place once and the runtime overhead is minimal. This means that
> in
> principle Julia+CUDA should be able to achieve the same performance as NVCC-
> compiled .cu files :-)
>
>
> There are plenty of limitations though:
>

> - cannot alloc in PTX code
> - only can pass bitstypes, arrays or pointer
> - stdlib functionality is unavailable, because it lives in another module

>
>
> In short: it is only usable for relatively simple kernels with non-complex
> data
> interactions.
>
>
>
>
>
> Now, for the usage instructions. First, compile the modified Julia with PTX
> codegen support:
>
> $ git clone https://github.com/maleadt/julia.git
> $ cd julia
> $ make LLVM_VER=3.5.0
>
> Optionally:
> $ make LLVM_VER=3.5.0 testall
>

> *NOTE*: the compiler requires `libdevice` to link kernel binaries. This

> library is
> only part of recent CUDA toolkits (version 5.5 or greater). If you use
> older an
> older CUDA release (for example because you use the GPU Ocelot emulator
> which
> only supports up to CUDA 5.0) you _will_ need to get a hold of these files.
> Afterwards, you can point Julia to these files using the NVVMIR_LIBRARY_DIR
> environment variable.
>
>
> Secondly, install the CUDA driver API:
>
> $ ./julia
>
> > Pkg.clone("https://github.com/maleadt/CUDA.jl.git")
>
> Optionally:
> $ ./julia
>
> > Pkg.test("CUDA")
>
> Lastly, if you don't have any CUDA hardware, you can use the GPU ocelot
> emulator:
>
> $ git clone --recursive https://github.com/maleadt/gpuocelot.git
> $ cd gpuocelot
> $ $EDITOR ocelot/scripts/build_environment.py # edit nvcc invocation --
> see README
> $ CUDA_BIN_PATH=/opt/cuda-5.0/bin CUDA_LIB_PATH=/opt/cuda-5.0/lib
> CUDA_INC_PATH=/opt/cuda-5.0/include \
> python2 build.py --install -p $(realpath ../julia/usr)
>

> *NOTE*: this might not compile flawlessly. You'll need at least the CUDA

maleadt

unread,

Dec 19, 2014, 1:30:44 PM12/19/14

to juli...@googlegroups.com

OpenCL through SPIR definitely seems like the superior option, but I opted for the NVIDIA stack because of its maturity. That said, I'm sure most of the engineering work here can be reused for similar back-ends.

As most of the programming is/was still pretty exploratory, I didn't want to bother upstream with potentially useless features, but yeah if the current approach proves itself I guess it should better live in julia/master rather than in some isolated tree.
One feature I have submitted upstream is the llvmcall declaration support: https://github.com/JuliaLang/julia/pull/8740 -- but that hasn't been merged yet.

Op vrijdag 19 december 2014 18:01:26 UTC+1 schreef Valentin Churavy:

maleadt

unread,

Dec 19, 2014, 1:46:10 PM12/19/14

to juli...@googlegroups.com

I'm afraid I didn't pay too much attention to keeping the changes mergeable. In principle my version of CUDA.jl should be mostly compatible with the one in JuliaGPU, and there hasn't been any new developments there. No idea about CUDArt though, haven't looked at that in detail.

Op vrijdag 19 december 2014 18:34:29 UTC+1 schreef Tim:

Tim Holy

unread,

Dec 19, 2014, 2:21:36 PM12/19/14

to juli...@googlegroups.com

CUDA hasn't changed much at all, but CUDArt is quite different:

- I found I was getting segfaults with the driver API when I tried
using it in conjunction with cuFFT, so I switched to the runtime API
(later, I think @moonpence figured out a solution).

- The binding is generated by Clang, so it offers complete (low-level)
coverage of the runtime API.

- There are more methods to work conveniently with arrays (e.g., copying
sub-blocks to and from GPU memory)

- There are facilities to help with resource initialization & cleanup

- There are C++ kernels for a few utility functions like `fill!`

The README outlines further differences from CUDA.

That said, a big chunk of it is to make it easier to use kernels written in
C/C++. Thanks to you, that may not be so important anymore! I'm really excited
about that :-).

I don't have time now, but I'll definitely look through your code. I suspect
the best thing will be to make your version the official CUDA package, and merge
any useful bits of my work (if any) into it. Assuming, that is, that we can
use your package in conjunction with cuFFT.

Best,
--Tim

Valentin Churavy

unread,

Dec 19, 2014, 2:26:08 PM12/19/14

to juli...@googlegroups.com

In the end we will need to support both SPIR and CUDA in the long run, since NVIDIA decided not to support SPIR.

I will go through you branch and see if I can extract features into separate pull-request and then add the SPIR backend.

Valentin Churavy

unread,

Dec 20, 2014, 11:32:30 AM12/20/14

to juli...@googlegroups.com

So the first change I isolated is address space isolating bitcasts. https://github.com/JuliaLang/julia/pull/9423

Tim looking at the Compare between master and your work: https://github.com/maleadt/julia/compare/JuliaLang:master...master I see that dump_llvm_module and dump_native_module are exported, but they are not included in that final diff. I assume you just forgot to remove them in https://github.com/maleadt/julia/commit/396a5bcd1850c5352c6d7d72e63cd28a8b945e11 ?

Best Valentin

Tim Besard

unread,

Dec 20, 2014, 11:59:28 AM12/20/14

to juli...@googlegroups.com

Hi Valentin,

2014-12-20 17:32 GMT+01:00 Valentin Churavy <v.ch...@gmail.com>:
> Tim looking at the Compare between master and your work:
> https://github.com/maleadt/julia/compare/JuliaLang:master...master I see
> that dump_llvm_module and dump_native_module are exported, but they are not
> included in that final diff. I assume you just forgot to remove them in
> https://github.com/maleadt/julia/commit/396a5bcd1850c5352c6d7d72e63cd28a8b945e11
> ?

Yes exactly, these are invalid exports.

Tim

Viral Shah

unread,

Jan 1, 2015, 12:11:56 AM1/1/15

to juli...@googlegroups.com

I agree that it would be great if this could be submitted as a PR to julia and CUDA.jl. That will serve as a starting point for others interested.

We also want to make GPU clusters easily accessible for Julia users on JuliaBox, which will make it easier for everyone to experiment with such code. We should get to this soon.

-viral

maleadt

unread,

Jan 15, 2015, 3:58:43 AM1/15/15

to juli...@googlegroups.com

Small update: I have written a blogpost with some more details about the code and how to use it. I'll also be presenting a poster about this work on the upcoming HiPEAC conference in Amsterdam.

Best,
Tim

Dahua

unread,

Apr 10, 2015, 9:52:29 AM4/10/15

to juli...@googlegroups.com

What's the progress on this work?

Definitely welcomed as PR to CUDA.jl

Dahua

Kevin Squire

unread,

Apr 10, 2015, 10:42:16 AM4/10/15

to juli...@googlegroups.com

There was an issue submitted to Julia mainline, but it hasn't gotten much traffic:

https://github.com/JuliaLang/julia/pull/9423

Valentin Churavy

unread,

Apr 10, 2015, 11:47:55 PM4/10/15

to juli...@googlegroups.com

So the main barrier is getting the code mainlined. The PR mentioned by Kevin is necessary although currently pointless because it depends on the capability of switching targets, for which I had no time preparing a PR yet. The other thing this hinges on is either an extension of llvmcall like https://github.com/JuliaLang/julia/pull/8740 or the introduction of llvmdecl https://github.com/JuliaLang/julia/issues/8308.

I hope to find some time in the next week to tackle at leas one of these issues, but grad school is currently keeping me fairly busy.

Valentin Churavy

unread,

May 25, 2015, 10:51:44 AM5/25/15

to juli...@googlegroups.com

So real life kept me relatively busy until now, but thanks to Simon bugging me I started to work on this again. My first step is to rebase the code onto the current Julia master. The work will happen over at https://github.com/JuliaGPU/julia and my current work is on the branch vc/cuda. I hope Tim is okay with me massively cutting the code history down to 4 commits.

The current issue I am working on is adopting his code to the recent tuple changes and from there on I want to enable OpenCL/SPIR (realistically by the time I get around doing that we are probably/hopefully in SPIR-V land).