Hi Victor,
I'm not an expert on cupy, but I think it comes at the GPU acceleration problem from a different (and equally useful) angle than Numba. Numba is primarily designed to compile custom algorithms written in Python for targets like the CPU and GPU. Numba includes a "GPU device array" object which is pretty barebones, and exists simply because we couldn't write a compiler for GPU kernels if we didn't have a place to put GPU data.
On the other hand, Cupy looks like a fairly complete reimplementation of the NumPy API using the GPU. This is great, because NumPy has a familiar and flexible API. Constructing a custom kernel to fuse operations together is something that Cupy also appears to do, but with less flexibility than Numba.
It is interesting to note that there are now several reimplementations of NumPy arrays on the GPU that have been created incidentally to support GPU computing:
- PyTorch tensors
- Cupy ndarrays
- TensorFlow tensors
- Numba DeviceArrays
- PyCUDA DeviceAllocations
We are hugely in favor of an initiative where all these implementations agreeing to some common way of sharing their GPU device pointer, memory layout, and element types so data from one package can be shared with any other one. Numba has an private implementation of this concept, but we would happily abandon it for some shared standard. Cupy seems like an excellent Python-based GPU container, and we'd love to have Numba support reading and writing data in this container.
That said, today there isn't any of the above interop, so I would make the follow suggestion: Cupy sounds like a good choice for doing basic NumPy-like GPU computations. If you run into things that are hard to express in Cupy, Numba would be a great tool to solve that problem (and hopefully we'll have a way to share data with Cupy in the near future). If you want to write more complex GPU code where you have direct control over the operations of CUDA threads, Numba is the better choice to start with.