UPC++ Memory Issue with GPU Device Allocator

10 views

Skip to first unread message

Kirtus Leyba

unread,

Jan 12, 2023, 6:00:25 PM1/12/23

to UPC++

Hi again all,

I've recently noticed something strange about using device allocator and memory kinds that is confusing me.

For my problem, when I increase the number of GPUs, each GPU gets a smaller portion of the simulation to work on. For instance, if my problem size is 8,000 by 8,000 and I use 2 GPUs they might work on 4,000 by 8,000 pieces of a 2D array, as a simple example.

This memory footprint is accurate when I test with nvidia-smi and watch memory usage. However, if I scale up the problem enough I crash with allocation on multiple devices even though there is plenty of memory on each device.

At a certain point one call to the device allocators allocate method returns null instead of a global pointer and things stop working. Here is how I'm allocating memory:

/*
h_allocate_buffer: uses a cuda allocator to allocate a buffer on the GPU
params:
const &upcxx::device_allocator<upcxx::cuda_device> gpu_alloc - the allocator
size_t count - the count of entries in the buffer to be allocated
*/
template <class T>
upcxx::global_ptr<T, dev_mem> h_allocate_buffer(upcxx::device_allocator<upcxx::cuda_device> &gpu_alloc,
size_t count, std::string name){
upcxx::global_ptr<T, dev_mem> buff = gpu_alloc.allocate<T>(count);
if(!buff){
h_message("[ERROR]: rank ", upcxx::rank_me(), " failed to allocate a buffer of size ", count*sizeof(T), "! for: ", name, "\n");
exit(1);
}
return buff;
}

An example of calling this function:

//create an allocator with the buff size
auto gpu_alloc = upcxx::device_allocator<upcxx::cuda_device>(gpu_device, buff_size);

//create pointers to specific regions of memory and allocate them
upcxx::global_ptr<GridPoint, dev_mem> gp_buff = h_allocate_buffer<GridPoint>(gpu_alloc, l_dims.num_points, "gp_buff");

I've tried increasing both the UPC++ shared heap size and the GASNET Max heap size environment variables with no change. If I could track down what is causing this behavior it would be greatly helpful!

Regards,
Kirtus Leyba

Dan Bonachea

unread,

Jan 12, 2023, 8:42:42 PM1/12/23

to Kirtus Leyba, UPC++

Hi Kirtus -

Thanks for the great question!

Each upcxx::device_allocator<upcxx::cuda_device> object manages a CUDA device memory segment of size determined by the sz_in_bytes constructor argument (named buff_size in your example). This is completely independent from the UPC++ shared heap size and GASNET_MAX_SEGSIZE which govern the host memory shared segment.

So most likely you need to increase the value of your buff_size variable to accommodate your data structures into the device segment. Note that upcxx::device_allocator<upcxx::cuda_device> defaults to aligning every allocation to a 256-byte boundary, rounding up the space consumed as needed. Also multi-page objects are rounded up to a page boundary. So the device segment size used to create the upcxx::device_allocator<upcxx::cuda_device> needs to be "a bit larger" than the sum of the data structures you'll allocate there (where the amount of "padding" you'll need depends on the allocation pattern).

As a side note, you might want to check out the convenience function upcxx::make_gpu_allocator() added in 2022.3.0 that streamlines device_allocator construction with less boilerplate code. More details here

Hope this helps..

-D

--
You received this message because you are subscribed to the Google Groups "UPC++" group.
To unsubscribe from this group and stop receiving emails from it, send an email to upcxx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/upcxx/0222c9f4-8994-48de-87d5-ef713f81eea2n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages