Hi again all,
I've recently noticed something strange about using device allocator and memory kinds that is confusing me.
For my problem, when I increase the number of GPUs, each GPU gets a smaller portion of the simulation to work on. For instance, if my problem size is 8,000 by 8,000 and I use 2 GPUs they might work on 4,000 by 8,000 pieces of a 2D array, as a simple example.
This memory footprint is accurate when I test with nvidia-smi and watch memory usage. However, if I scale up the problem enough I crash with allocation on multiple devices even though there is plenty of memory on each device.
At a certain point one call to the device allocators allocate method returns null instead of a global pointer and things stop working. Here is how I'm allocating memory:
/*
h_allocate_buffer: uses a cuda allocator to allocate a buffer on the GPU
params:
const &upcxx::device_allocator<upcxx::cuda_device> gpu_alloc - the allocator
size_t count - the count of entries in the buffer to be allocated
*/
template <class T>
upcxx::global_ptr<T, dev_mem> h_allocate_buffer(upcxx::device_allocator<upcxx::cuda_device> &gpu_alloc,
size_t count, std::string name){
upcxx::global_ptr<T, dev_mem> buff = gpu_alloc.allocate<T>(count);
if(!buff){
h_message("[ERROR]: rank ", upcxx::rank_me(), " failed to allocate a buffer of size ", count*sizeof(T), "! for: ", name, "\n");
exit(1);
}
return buff;
}
An example of calling this function:
//create an allocator with the buff size
auto gpu_alloc = upcxx::device_allocator<upcxx::cuda_device>(gpu_device, buff_size);
//create pointers to specific regions of memory and allocate them
upcxx::global_ptr<GridPoint, dev_mem> gp_buff = h_allocate_buffer<GridPoint>(gpu_alloc, l_dims.num_points, "gp_buff");
I've tried increasing both the UPC++ shared heap size and the GASNET Max heap size environment variables with no change. If I could track down what is causing this behavior it would be greatly helpful!
Regards,
Kirtus Leyba