CUDPP code is is very large

8 views
Skip to first unread message

garyo

unread,
Dec 10, 2009, 1:32:58 PM12/10/09
to CUDPP
I need to do a parallel scan in my CUDA app, so I'm looking at CUDPP.
The API looks very nice, but I see that the built (final-mode) library
is over 30MB, and just scan_app.o by itself is over 7MB! (on Linux/64
and Windows/32). This seems to be due to the compiled-in CUDA code in
__deviceText_$sm_10$ and __deviceText_$compute_10$, which are 2.2MB
and 4.4MB respectively (on Linux). Is this what others get?

If I included this in my product it would nearly double the download
size; I don't think I can afford to do it. Is there any way to cut it
down? I only need a few types of scans that I know in advance, if
that helps.

John Owens

unread,
Dec 10, 2009, 1:37:04 PM12/10/09
to cu...@googlegroups.com
libcudpp.a on OS X is 37.6 MB so I'm afraid you are probably looking at an accurate size.

This is not something we've even discussed. I don't know why it's so big, personally, though Mark might have some thoughts. It would seem like having the CUDA code would be useful :) but I don't have an answer for you; I'd be happy to hear from anyone on the list who has any experience with this issue.

JDO


--

You received this message because you are subscribed to the Google Groups "CUDPP" group.
To post to this group, send email to cu...@googlegroups.com.
To unsubscribe from this group, send email to cudpp+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cudpp?hl=en.



Mark Harris

unread,
Dec 10, 2009, 5:43:50 PM12/10/09
to cu...@googlegroups.com
Hi Gary,

This is a known issue that we have discussed. :) 

This is an unfortunate side effect of the heavy template optimization done in CUDPP to achieve high performance.  Generating template functions for each possible combination of options makes for optimal code, but a lot of it.  I've added a comment to issue 18 explaining how you can save a lot of space and compile time by commenting out the dispatches of template versions you don't need.  I'll repeat that comment hear to help you.

Gary, we really do want people to use CUDPP in real products, so feedback like this is great.  Thanks -- please let us know if the following is a sufficient fix for you.  In the future I'd like to find a way to reduce file size, or at least to allow users to configure at compile time what options are included.  Suggestions welcome.

Just a comment to help users who have issues with this problem.  If you need to 
reduce the CUDPP library binary size, you can comment out generation of the template 
kernels you don't need.  In any of the *_app.cu files there is a Dispatch function 
for the corresponding algorithm.  To optimize performance we have to use a large 
switch/if-else to dispatch at run time the appropriate compile-time optimized 
template kernel function.  To reduce compiled object size and also compile time, you 
can simply comment out the switch options that you don't need.

For example, if you don't need segmented scan, comment out everything inside 
cudppSegmentedScanDispatch(): 
http://code.google.com/p/cudpp/source/browse/tags/1.1/cudpp/src/app/segmented_scan_ap
p.cu#386

Then, if you only need forward exclusive integer +-scans, comment out everything but 
the lines that invoke that type of scan:

http://code.google.com/p/cudpp/source/browse/tags/1.1/cudpp/src/app/scan_app.cu#446

Your compile time and file size will be greatly reduced.
Mark

garyo

unread,
Dec 13, 2009, 4:09:47 PM12/13/09
to CUDPP
Hi, Mark -- thanks, that's just the pointer I needed.

(btw, I used to work at Thinking Machines, so scans are old
friends to me. Can't imagine life without them.)

I fixed it in a hacky way by just #ifdefing out the things I don't
want
in the app_*.cu Dispatch functions, and ended up with the whole lib
as ~1.1M, with one scan: forward float add (which is around 300k
itself).

There's still 300k of radixsort code I don't need even after
I ifdefed out a lot of stuff -- that could probably just be all
removed
from the compile process, but maybe it doesn't get linked in if not
needed so I don't think I particularly care.

My suggestion would be a compile-time config file that auto-generates
the dispatch functions based on what you need; maybe via a python
script that just spits out the dispatch functions into a .c file
that gets #included into the app file.

(There may also be a way to not template *everything*, I don't know.)

Thanks,

-- Gary Oberbrunner

Mark Harris

unread,
Dec 13, 2009, 8:29:05 PM12/13/09
to cu...@googlegroups.com
Hi Gary,

Glad you got it working to your satisfaction.  RadixSort is definitely linked in whether you need it or not, along with all the other algorithms.

Unfortunately once we add support for Fermi code, the code size will only grow, so we need to address these problems.  These are all things we can hopefully investigate, perhaps for the release after next. Another option is to divide CUDPP into separate algorithm libraries.  

Note that there is another library, Thrust, that is a template library and hence only the code you use is compiled.  However Thrust is designed for development efficiency and generality, not runtime efficiency, so often their algorithm implementations are not as fast as CUDPP's. 

Mark

Reply all
Reply to author
Forward
0 new messages