The problem with GPUs is that you can't work with them seamlessly. GPU is a separate device and loading data to and from this device takes time, often much larger than computation itself. Moreover, GPU is optimized for high throughput, so applying it to considerably small arrays (say, 1024x1024 matrix) in most cases will be slower than the same task on CPU.
So normally we use GPU only for large computations and try to minimize IO with GPU, trying to keep data on device when possible. Thus I heavily suggest to do performance testing before proceeding with the approach you started in
OpenCLBLAS.jl. Also note that we already have a similar project -
JuliaGPU/CLBLAS.jl - so you may be interested in contributing to it.