So the more I look at this the less appealing it seems to be. At some point you end up having to rewrite the Julia Codegen code and LLVM C++ convenience classes in Julia. Then you would have to track them in a parallel development path. Not to mention what would have to be done when Julia statically compiles...
Maybe what I am looking for is a modularization the Julia's C++ JIT codegen code. The GPU (and "dangerous" CPU code with bounds checking moved to before the for loops, etc...) would then have direct access to parts of the stock Julia codegen (or access to the same LLVM library) and memory management.
The codegen modules could register requirements they have for memory allocation and then all memory allocations could meet their specific critiera (alignment, etc...). They would also be able to tag objects in memory, one use of this would be to lock arrays when they are being processed on the GPU. I think the intermediate GPU code could be kept in resources that the CPU JIT compiler would deal with, so they could be exported to libraries.