cuda code takes forever to compile and consume all system memory

Kai Song

unread,

Jan 4, 2011, 1:48:50 PM1/4/11

to vscse-many-core...@googlegroups.com

Hi,

I am not sure if this mailing list is still active, but I have been stuck on this problem for a long time, and maybe some of you have run into this problem and can give me a clue.

My application has a big data structure, namely, 63692 Bytes. My intended implementation is that each thread has a copy of this data structure with different parameters in global memory. Then each thread runs embarrassingly parallel and save the individual result in the data structure it associated with. When they finish running, copy all data structures in the global memory back to host. And the host can aggregate the results.

There are hundreds of kernel functions written to work with the data structure. I have tested their functionality by running everything on CPU. However, when I put these functions in kernel and invoke with one thread, the code takes forever to compile and gradually consumes all host memory. I have no clue what is the cause, and I get the following warning message:
==========================================================
/tmp/tmpxft_00002d59_00000000-7_TracyGPmain.cpp3.i(0): Warning: Olimit was exceeded on function _Z11test_kernelP4RingPf; will not perform function-scope optimization.
To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=222003
==========================================================

I have looked up this warning, and it seems to have something to do with the exhaustion of registers, but I have minimized the use of local variable for each function by pre-allocating scratch space in the data structure. I am really out of things to try now, because I can't even compile the code. Am I out of luck and there is nothing I can do? Or, Is there anything wrong with my initial approach? If not, is there still something else I can try to make the code compile?

Many thanks in advance! And please let me know if you need more information for the code.

Kai

John Stratton

unread,

Jan 4, 2011, 2:08:42 PM1/4/11

to vscse-many-core...@googlegroups.com

Interesting situation. I haven't seen too many applications that try to work with a data structure that large for every element. So yes, you'll probably run into some less common issues.

My first comment is that registers are used for more than local variables. In fact, some local variables may not take up an "extra" register at all. Putting values in registers, and thus determining how many registers a function needs, is done by the compiler, which tracks all the data values you load, store or compute to any variable or memory location, and puts the values your code uses in registers when it needs to use them.

So when you allocated scratch space in the data structure to avoid using registers, the compiler still has the option to keep the value in a register, especially if it knows that your code is using it later.

To be more explicit to the compiler, you should think about putting your scratch space in local variables with the __local__ keyword. This is a more direct way of telling the compiler "I want every thread to have space to put these values, and I don't want that space to come from the registers." Since the compiler can't figure out that's what you meant to do with the code you have, it's essentially undoing it in an attempt to increase performance. Using the local keywords should help the compiler understand what you actually meant to do.

The other question I had is, are you putting all those hundreds of functions into a single kernel, or are you declaring a different kernel for each? The size of the kernel can sometimes present problems for the compiler and hardware as well in terms of resource usage.

--John

Kai Song

unread,

Jan 4, 2011, 2:53:14 PM1/4/11

to vscse-many-core...@googlegroups.com

Hi John,

Thanks for the replay. Great information! By declaring "__local__" will probably help the compiler a little bit, but as you pointed out the potential issue with large kernel, I indeed put all the functions into one single kernel. The application involves many 5x5 matrix/vector operations, so there are a big number of functions are matrix/vector functions. Given the size of the matrix/vectors, I didn't use tuned matrix library. Also, there is another big set of functions built on top of these math functions. Essentially, this large kernel may be the main problem.

I put all these functions in lib.cu, and invoke the ones I need in kernel.cu. I am not sure if this is the best way to do it. When you say "declaring a different kernel for each," do you mean breaking up the large kernel file in different small files? This is my first time porting a actual application to cuda, so I have no clue to deal with constructing a cuda kernel with many functions in it. After all, I am very concerned about the possibility of porting this embarrassing parallel application to GPU. Any advice?

Again, thanks a lot for your help!

Kai

John Stratton

unread,

Jan 4, 2011, 3:44:45 PM1/4/11

to vscse-many-core...@googlegroups.com

Ah, yes, that makes sense.

When I say "declare a different kernel for each", what I mean is that your application might just be too big to accelerate all in one kernel launch. If you already had scratch space in global memory, it might be better to break it up into many kernel launches. The reason most people don't do this is because between each kernel launch you'd have to store any computed results and reload all your data again in the next kernel, which is very expensive. However, since it looks like you were doing that anyway because of register limitations, it might not be such a bad idea.

Before going all the way to that route, though, look and see if some of your functions are being called multiple times. If so, the compiler by default is "inlining" them, or making a copy of all that code and pasting it in every place the function is called. If that's so, it might be significantly increasing the size of your code, possibly unnecessarily, and may be causing some of the slow compilation times and resource explosions you're seeing. If you find any functions used like that, try adding the "__noinline__" keyword to those function declarations, to tell the compiler to try and only keep one copy of the code, and treat it like a real function. Because of hardware limitations, there are some times this won't be possible, but from what you described I think it should work for you. Not sure how far it will actually get you, though.

--John

Reply all

Reply to author

Forward