Hi all, (warning: wall of text ahead)
During the last months, I have been working on and off on PTX support for the
Julia compiler and an accompanying package for interacting with the CUDA driver.
The current state allows for code such as:
using CUDA
@target ptx function kernel_vadd(a::CuDeviceArray{Float32}, b::CuDeviceArray{Float32},
c::CuDeviceArray{Float32})
i = blockId_x() + (threadId_x()-1) * numBlocks_x()
c[i] = a[i] + b[i]
return nothing
end
dev = CuDevice(0)
ctx = CuContext(dev)
cgctx = CuCodegenContext(ctx, dev)
dims = (3, 4)
a = round(rand(Float32, dims) * 100)
b = round(rand(Float32, dims) * 100)
c = Array(Float32, dims)
len = prod(dims)
@cuda (len, 1) kernel_vadd(CuIn(a), CuIn(b), CuOut(c))
@show a+b == c
destroy(cgctx)
destroy(ctx)
which is pretty neat I think :-)
However, due to an unfortunate turn of events I won't be able to spend much time
on this in the future. Consequently, I'm making this code public already so
anyone interested can look into it. I had hoped to polish it up a bit more
before exposing it to the masses.
A quick overview of the components:
1) Modified Julia compilerhttps://github.com/maleadt/juliaDiff against upstream:
- @target macro for adding a function-level "target" field to linfo
- a separate set of LLVM objects for PTX codegen (module, pass manager, etc)
- new exported functions for accessing the PTX code
- some codegen exceptions spread over the codegen to generate valid PTX code
Note that I tried to keep the codegen modifications to a minimum. Most changes
come from using an address-preserving bitcast, because PTX supports multiple
address spaces. Other changes can actually be reverted, see the TODO.md for more
information.
2) CUDA.jl runtime
https://github.com/maleadt/CUDA.jlBased on JuliaGPU/CUDA.jl. Heavily modified (still incorporates original
functionality though):
- native execution support
- lightwight on-device arrays
- improved API consistency
- many new features
The most significant part is obviously the native execution support. The package
defines an
@cuda macro, which compiles the called Julia function to PTX code,
and interacts with the driver (creating and uploading a module, packing
arguments, launching the kernel). It also supports seamless passing of Julia
arrays, mimicking the pass-by-sharing convention, and providing wrapper types to
indicate whether the object is input/output.
Most of the functionality in
@cuda is built using staged functions, meaning it
only takes place once and the runtime overhead is minimal. This means that in
principle Julia+CUDA should be able to achieve the same performance as NVCC-
compiled .cu files :-)
There are plenty of limitations though:
- cannot alloc in PTX code
- only can pass bitstypes, arrays or pointer
- stdlib functionality is unavailable, because it lives in another module
In short: it is only usable for relatively simple kernels with non-complex data
interactions.
Now, for the usage instructions. First, compile the modified Julia with PTX
codegen support:
$ git clone https://github.com/maleadt/julia.git
$ cd julia
$ make LLVM_VER=3.5.0Optionally:
$ make LLVM_VER=3.5.0 testallNOTE: the compiler requires `libdevice` to link kernel binaries. This library is
only part of recent CUDA toolkits (version 5.5 or greater). If you use older an
older CUDA release (for example because you use the GPU Ocelot emulator which
only supports up to CUDA 5.0) you _will_ need to get a hold of these files.
Afterwards, you can point Julia to these files using the
NVVMIR_LIBRARY_DIRenvironment variable.
Secondly, install the CUDA driver API:
$ ./julia
> Pkg.clone("https://github.com/maleadt/CUDA.jl.git")Optionally:
$ ./julia
> Pkg.test("CUDA")Lastly, if you don't have any CUDA hardware, you can use the GPU ocelot
emulator:
$ git clone --recursive https://github.com/maleadt/gpuocelot.git
$ cd gpuocelot
$ $EDITOR ocelot/scripts/build_environment.py # edit nvcc invocation -- see README
$ CUDA_BIN_PATH=/opt/cuda-5.0/bin CUDA_LIB_PATH=/opt/cuda-5.0/lib CUDA_INC_PATH=/opt/cuda-5.0/include \
python2 build.py --install -p $(realpath ../julia/usr)NOTE: this might not compile flawlessly. You'll need at least the CUDA APIs, gcc
4.6, scons, LLVM 3.5 and Boost. Also always --install into some directory, and
only use absolute directories because of build system restrictions. Have a look
at the README.
$ ./julia
> using CUDA
> CUDA_VENDOR
"Ocelot"I hope this stuff can prove useful to somebody :-)
I've updated each of the repositories (julia, CUDA.jl, gpuocelot) with README
and TODO files, but if there are any questions feel free to contact me.
Tim