UPC++ Memory Issue with GPU Device Allocator

7 views
Skip to first unread message

Kirtus Leyba

unread,
Jan 12, 2023, 6:00:25 PM1/12/23
to UPC++
Hi again all,

I've recently noticed something strange about using device allocator and memory kinds that is confusing me.

For my problem, when I increase the number of GPUs, each GPU gets a smaller portion of the simulation to work on. For instance, if my problem size is 8,000 by 8,000 and I use 2 GPUs they might work on 4,000 by 8,000 pieces of a 2D array, as a simple example.

This memory footprint is accurate when I test with nvidia-smi and watch memory usage. However, if I scale up the problem enough I crash with allocation on multiple devices even though there is plenty of memory on each device.

At a certain point one call to the device allocators allocate method returns null instead of a global pointer and things stop working. Here is how I'm allocating memory:

/*                                                                                
h_allocate_buffer: uses a cuda allocator to allocate a buffer on the GPU          
params:                                                                            
    const &upcxx::device_allocator<upcxx::cuda_device> gpu_alloc - the allocator
    size_t count - the count of entries in the buffer to be allocated              
*/                                                                                
template <class T>                                                                
upcxx::global_ptr<T, dev_mem> h_allocate_buffer(upcxx::device_allocator<upcxx::cuda_device> &gpu_alloc,
                                                size_t count, std::string name){
    upcxx::global_ptr<T, dev_mem> buff = gpu_alloc.allocate<T>(count);            
    if(!buff){                                                                    
        h_message("[ERROR]: rank ", upcxx::rank_me(), " failed to allocate a buffer of size ", count*sizeof(T), "! for: ", name, "\n");
        exit(1);                                                                  
    }                                                                              
    return buff;                                                                  
}

An example of calling this function:

//create an allocator with the buff size                                  
        auto gpu_alloc = upcxx::device_allocator<upcxx::cuda_device>(gpu_device, buff_size);
                                                                                   
        //create pointers to specific regions of memory and allocate them          
        upcxx::global_ptr<GridPoint, dev_mem> gp_buff = h_allocate_buffer<GridPoint>(gpu_alloc, l_dims.num_points, "gp_buff");


I've tried increasing both the UPC++ shared heap size and the GASNET Max heap size environment variables with no change. If I could track down what is causing this behavior it would be greatly helpful!

Regards,
Kirtus Leyba

Dan Bonachea

unread,
Jan 12, 2023, 8:42:42 PM1/12/23
to Kirtus Leyba, UPC++
Hi Kirtus - 

Thanks for the great question!

Each upcxx::device_allocator<upcxx::cuda_device> object manages a CUDA device memory segment of size determined by the sz_in_bytes constructor argument (named buff_size in your example). This is completely independent from the UPC++ shared heap size and GASNET_MAX_SEGSIZE which govern the host memory shared segment.

So most likely you need to increase the value of your buff_size variable to accommodate your data structures into the device segment. Note that upcxx::device_allocator<upcxx::cuda_device> defaults to aligning every allocation to a 256-byte boundary, rounding up the space consumed as needed. Also multi-page objects are rounded up to a page boundary. So the device segment size used to create the upcxx::device_allocator<upcxx::cuda_device> needs to be "a bit larger" than the sum of the data structures you'll allocate there (where the amount of "padding" you'll need depends on the allocation pattern).

As a side note, you might want to check out the convenience function upcxx::make_gpu_allocator() added in 2022.3.0 that streamlines device_allocator construction with less boilerplate code. More details here

Hope this helps..
-D


--
You received this message because you are subscribed to the Google Groups "UPC++" group.
To unsubscribe from this group and stop receiving emails from it, send an email to upcxx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/upcxx/0222c9f4-8994-48de-87d5-ef713f81eea2n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages