Does ocelot support dynamic memory allocation in device?

317 views
Skip to first unread message

Paweł Pieńkowski

unread,
Jan 20, 2013, 10:45:31 AM1/20/13
to gpuo...@googlegroups.com
My algorithm (parallel multi-frontal Gaussian elimination) needs to dynamically allocate memory (tree building) inside CUDA kernel. Do ocelot support such things?

According to this: stackoverflow-link and CUDA programming guide I can do such things.

Errors:
1. When I call malloc() inside kernel I get this error:
(2.000239) ExternalFunctionSet.cpp:371:  Assertion message: LLVM required to call external host functions from PTX.
solver: ocelot/ir/implementation/ExternalFunctionSet.cpp:371: void ir::ExternalFunctionSet::ExternalFunction::call(void*, const ir::PTXKernel::Prototype&): Assertion `false' failed.
2. When I try to get or set malloc heap size (inside host code):
solver: ocelot/cuda/implementation/CudaRuntimeInterface.cpp:811: virtual cudaError_t cuda::CudaRuntimeInterface::cudaDeviceGetLimit(size_t*, cudaLimit): Assertion `0 && "unimplemented"' failed.

Any advice?

Greg Diamos

unread,
Jan 21, 2013, 6:19:15 PM1/21/13
to gpuo...@googlegroups.com

Hi Pawel,

Ocelot does support dynamic memory allocation in the device but it needs to have LLVM installed in order to be able to make calls from device code into host code (for malloc and other functions).

Regarding the heap size, we leave that interface unimplemented since Ocelot doesn't actually have a limit on the heap size other than the limit on the host implementation of malloc.  I'll plan on removing that assertion so you can use code that sets the heap size with Ocelot, but it won't actually impose a limit on the heap size.

Greg

--
You received this message because you are subscribed to the Google Groups "gpuocelot" group.
To view this discussion on the web visit https://groups.google.com/d/msg/gpuocelot/-/gsBjDpE7NcsJ.
To post to this group, send email to gpuo...@googlegroups.com.
To unsubscribe from this group, send email to gpuocelot+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gpuocelot?hl=en.

Paweł Pieńkowski

unread,
Jan 22, 2013, 9:27:42 AM1/22/13
to gpuo...@googlegroups.com
Hi Greg!

Thanks for info. That are pretty good news for me. But I have still some problems with it. Could you help me with that or give a piece of advice?

I've installed my ocelot from debian package: ocelot_2.1.1865_amd64.deb. I had in my system LLVM version 3.0-10.
So as I see in code (/trunk/ocelot/ocelot/ir/implementation/ExternalFunctionSet.cpp) I had not defined HAVE_LLVM constant.
Next thing: I had in my configure.ocelot: devices: [ "nvidia", "emulated", "llvm", "amd" ] so I changed it to devices: [ "llvm" ].
I compile my simple malloc test this way:
nvcc --cuda -arch=sm_21 malloc.cu
g++ -o malloc malloc.cu.cpp.ii `OcelotConfig -l`
And now I got such message:
Assertion message: LLVM support not compiled into ocelot.
and as I see in code it's fault of constant HAVE_LLVM.

So my question is: Do I have to compile or run my program in a special way in order to run it with LLVM? Or maybe I have to compile ocelot by myself with LLVM support (and how to do it)?

Paweł Pieńkowski

unread,
Jan 22, 2013, 11:35:35 AM1/22/13
to gpuo...@googlegroups.com
I've compiled ocelot from source in trunk. Now I have llvm version 3.1 & ocelot version 2.1.

I can compile and run my simple malloc test:
#include <stdlib.h>
#include <stdio.h>
#include <cuda.h>
__global__ void mallocTest() {
size_t size = 123;
char* ptr = (char*)malloc(size);
memset(ptr, 0, size);
printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
free(ptr);
__syncthreads();
}
int main() {
int threads = 7;
mallocTest<<<1, threads>>>();
cudaDeviceSynchronize();
return 0;
}

But it seems to hang up and doesn't work. So I have no idea how to fix it or find the problem.

When I change kernel not to allocate memory:
__global__ void mallocTest() {
char* ptr = NULL;
printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
 __syncthreads();
}
It's works properly - writes out 7 pointers set to 0. So the problem maybe that calling malloc hangs program. But why?

Greg Diamos

unread,
Jan 22, 2013, 2:59:34 PM1/22/13
to gpuo...@googlegroups.com

That shouldn't deadlock, but I would guess that there is a bug in the cudaDeviceSynchronize call.  Can you try removing it and seeing if that fixes the deadlock?

Greg

To view this discussion on the web visit https://groups.google.com/d/msg/gpuocelot/-/uWD7mldVnGoJ.

Paweł Pieńkowski

unread,
Jan 22, 2013, 4:53:30 PM1/22/13
to gpuo...@googlegroups.com
You are ocelot developer, aren't you?

I did that test (without cudaDeviceSynchronize()) and result is the same - hang up. I suppose this is not deadlock because in most cases ocelot catches my cuda code deadlocks.

And the funny thing. If I change (in configure.ocelot) devices to nvidia the output is:
==Ocelot== WARNING - No CUDA devices found or all devices disabled!
==Ocelot==  Consider enabling the emulator in configure.ocelot.
Thread 0 got pointer: 0x7f832e79aef8
Thread 1 got pointer: 0x7f832e79aef8
Thread 2 got pointer: (nil)
Thread 3 got pointer: (nil)
Thread 4 got pointer: (nil)
Thread 5 got pointer: (nil)
Thread 6 got pointer: (nil)
I don't have nvidia card at all. That's strange - pointers should be different and not null.

I need this for my university graduate project. I'm writing FEM parallel solver and there is a tree building algorithm which needs to allocate device global memory dynamically. So I stuck at this point.

Best,
Paweł

Greg Diamos

unread,
Jan 22, 2013, 6:21:59 PM1/22/13
to gpuo...@googlegroups.com
Yes, I'm an Ocelot developer.

I just wanted to reply back on the main thread with the solution.  In order to synchronize with a running kernel
using cudaDeviceSynchronize, Ocelot needs to run the kernel asynchronously, otherwise cudaDeviceSynchronize
blocks access to the device (instead of the device launch queue), and malloc needs access to the device to allocate more
memory.

Asynchronous execution wasn't on by default because it hasn't been completely stable in the past, but it has been
passing all of the regression tests for the last few months, so I enabled it by default in r2168.

You can force it on in earlier versions by setting 'asynchronousKernelLaunch: True' in configure.ocelot.

Greg

To view this discussion on the web visit https://groups.google.com/d/msg/gpuocelot/-/dFuPm3JwcpMJ.

Paweł Pieńkowski

unread,
Jan 24, 2013, 1:12:06 PM1/24/13
to gpuo...@googlegroups.com
Thanks a lot Greg!

I would never figure out this by myself. ;)
Reply all
Reply to author
Forward
0 new messages