utest crashes on cudaMalloc

23 views
Skip to first unread message

patfla

unread,
Sep 11, 2010, 9:43:11 PM9/11/10
to opencurrent-users
Am building OpenCurrent for the first time.

CentOS 5.5 64 bits. CUDA SDK 3.1 64 bits. Nvidia Geforce GT 240M
with latest nvidia devel driver 256.40 (first time I ran one of the
OpenCurrent programs it told me the latest driver was required so I
downloaded and installed it).

I had problems with NetCDF. As I was first building, it complained of
various missing symbols. My quick fix was to back up from NetCDF v. 4
to version 3 by using

./configure --disable-netcdf-4

This built and I continued. OpenCurrent appears to build fine but
when I try my first program, utest, it crashes as follows:

[patfla@localhost tests]$ ./utest
[INFO] Running on GPU 0
Running tests: RayleighTimingTest, RayleighNoSlipTest, RayleighTest,
PCGDoubleTest, LockExTest, LockExDoubleTest, NSTest,
MultigridMixedTest, MultigridDoubleTest, ProjectDoubleTimingTest,
ProjectDoubleTest, ProjectTest, MultigridTest,
Advection3DDoubleSymmetryTest, Advection3DDoubleTest,
Advection3DDoubleSwirlTest, Advection3DSymmetryTest, Advection3DTest,
SamplingTest, Grid1DReduceTest, Grid1DReduceDoubleTest,
Reduce1DTimingTest, Reduce1DDoubleTimingTest, Grid3DTest,
Grid3DReduceTest, Reduce3DTimingTest, Grid3DReduceDoubleTest,
Reduce3DDoubleTimingTest, NetCDFTest, GridNetCDFTest, Diffusion3DTest,
Diffusion1DTest
running RayleighTimingTest on thread 0
[ERROR] Grid3DDeviceF::init - cudaMalloc failed
[ERROR] Eqn_IncompressibleNS3D::set_parameters - failed on
initializing v
[ASSERT] RayleighTimingTest::assert_true at /mnt/ddrive/Projects/
opencurrent-1.1.0/src/tests/rayleightiming.cpp line 136
[FAILED] RayleighTimingTest
running RayleighNoSlipTest on thread 0
run resolution 16
deltaT = 1674.000000
init min/max t = 52.312500 1621.687500
[ERROR] cudaMemcpy(HostToDevice) - CUDA error "invalid texture
reference"
[ERROR] Eqn_IncompressibleNS3D::set_parameters - failed copying to u
[ASSERT] RayleighNoSlipTest::assert_true at /mnt/ddrive/Projects/
opencurrent-1.1.0/src/tests/rayleighnoslip.cpp line 231
Segmentation fault
[patfla@localhost tests]$ emacs &
[1] 4792
[patfla@localhost tests]$


My first inclination is to think I shouldn't have backed up to NetCDF
3 but rather figured out how to get NetCDF v. 4 to work.

Then again, it could be something completely different.

I wonder if there are defines in the OpenCurrent source where I need
to specify the parameters for my particular card? (e.g. threads per
block) - and I'm looking for such.

Jonathan Cohen

unread,
Sep 11, 2010, 9:59:00 PM9/11/10
to opencurr...@googlegroups.com
i would suggest using CMake's 'make test' command rather than running utest directly.  Among other things, it runs all tests as a separate process, so if the program crashes, it won't prevent further tests from being run.

I agree that NetCDF seems to be a mess - if I had to do it over again, I would have built the IO system on top of HDF5 (in fact, that's a good project for someone...)  But you can disable it entirely for testing purposes by modifying the system.cmake file, settting OCU_NETCDF_ENABLED to FALSE.  See http://code.google.com/p/opencurrent/wiki/OpenCurrent for other build options.

I'm wondering if you're running out of memory?  Many of the default settings in the unit test assume 4GB of memory (e.g. from a Tesla system), and some of them will fail on GPUs with less memory.  If you run 'make test', please post back which tests pass and which fail

-Jon

patfla

unread,
Sep 11, 2010, 10:12:35 PM9/11/10
to opencurrent-users
On Sep 11, 6:59 pm, Jonathan Cohen <jco...@jcohen.name> wrote:
> i would suggest using CMake's 'make test' command rather than running utest
> directly.  Among other things, it runs all tests as a separate process, so
> if the program crashes, it won't prevent further tests from being run.
>
> I agree that NetCDF seems to be a mess - if I had to do it over again, I
> would have built the IO system on top of HDF5 (in fact, that's a good
> project for someone...)  But you can disable it entirely for testing
> purposes by modifying the system.cmake file, settting OCU_NETCDF_ENABLED to
> FALSE.  Seehttp://code.google.com/p/opencurrent/wiki/OpenCurrentfor other
> build options.
>
> I'm wondering if you're running out of memory?  Many of the default settings
> in the unit test assume 4GB of memory (e.g. from a Tesla system), and some
> of them will fail on GPUs with less memory.  If you run 'make test', please
> post back which tests pass and which fail
>
> -Jon
>
Thanx Jonathan,

4 GB! (on the video card). The Geforce GT 240M has 1 GB of DDR3
video memory. Any way of reducing the size of the computation?

I'll set OCU_NETCDF_ENABLED to FALSE; rebuild; run make tests; and
report back.

- pat

patfla

unread,
Sep 11, 2010, 10:31:27 PM9/11/10
to opencurrent-users
My first guess is that the few that succeeded are the least memory
intensive?

[patfla@localhost src]$ make test
Running tests...
Start processing tests
Test project /mnt/ddrive/Projects/opencurrent-1.1.0/src
1/ 24 Testing Diffusion1DTest ***Failed
2/ 24 Testing Diffusion3DTest ***Failed
3/ 24 Testing Advection3DTest ***Failed
4/ 24 Testing Grid1DTest ***Failed
5/ 24 Testing Grid3DTest ***Failed
6/ 24 Testing MultigridTest ***Failed
7/ 24 Testing ProjectTest ***Failed
8/ 24 Testing Advection3DDoubleTest ***Failed
9/ 24 Testing Grid1DDoubleTest ***Failed
10/ 24 Testing Grid3DDoubleTest ***Failed
11/ 24 Testing MultigridDoubleTest ***Failed
12/ 24 Testing MultigridMixedTest ***Failed
13/ 24 Testing ProjectDoubleTest ***Failed
14/ 24 Testing SamplingTest ***Failed
15/ 24 Testing NSTest ***Exception: SegFault
16/ 24 Testing LockExTest ***Failed
17/ 24 Testing PCGTest ***Failed
18/ 24 Testing NetCDFTest Passed
19/ 24 Testing Diffusion1DMultiTest ***Failed
20/ 24 Testing Diffusion3DMultiTest ***Failed
21/ 24 Testing CoArray1DTest Passed
22/ 24 Testing CoArray3DTest ***Failed
23/ 24 Testing MultiReduceTest Passed
24/ 24 Testing MultigridMultiTest ***Failed

13% tests passed, 21 tests failed out of 24


On the other hand, is this the opporutnity I've been waiting for to
run out (virtually) and buy a Tesla card?

Jonathan Cohen

unread,
Sep 11, 2010, 11:28:31 PM9/11/10
to opencurr...@googlegroups.com
i think all the tests that are passing are the ones that don't call any CUDA routines.  So that makes me think you've got some CUDA version mismatch problems.  Are you able to compile and run the CUDA sdk examples, or other CUDA programs?

patfla

unread,
Sep 11, 2010, 11:35:02 PM9/11/10
to opencurrent-users
On Sep 11, 8:28 pm, Jonathan Cohen <jco...@jcohen.name> wrote:
> i think all the tests that are passing are the ones that don't call any CUDA
> routines.  So that makes me think you've got some CUDA version mismatch
> problems.  Are you able to compile and run the CUDA sdk examples, or other
> CUDA programs?
>
Under Windows - yes. But this is Linux and I didn't test the SDK
sample programs.

Good idea - that's what I'll do.

patfla

unread,
Sep 12, 2010, 2:23:31 AM9/12/10
to opencurrent-users
I got the SDK samples to run but 'make test' with OpenCurrent still
failed in the same manner.

After some fumbling around I found that I needed to build with sm_12 -
and not sm_13.

I read the description of sm_13; saw the stress it put on doubles; and
believed that a 240M would not have that kind of support and so backed
up to sm_12.

It's may seem acadmic at this point but where do I read most about the
distinctions between sm_10 through sm_13? And just what the heck are
these? Yes, they have something to do with the 'programming model'
and just what hardware is supported but I'd like to find a better
answer than that.

[patfla@localhost sm12-rel]$ make test
Running tests...
Start processing tests
Test project /home/patfla/Projects/opencurrent-1.1.0/sm12-rel
1/ 17 Testing Diffusion1DTest Passed
2/ 17 Testing Diffusion3DTest Passed
3/ 17 Testing Advection3DTest Passed
4/ 17 Testing Grid1DTest Passed
5/ 17 Testing Grid3DTest Passed
6/ 17 Testing MultigridTest Passed
7/ 17 Testing ProjectTest Passed
8/ 17 Testing Advection3DDoubleTest Passed
9/ 17 Testing Grid1DDoubleTest Passed
10/ 17 Testing Grid3DDoubleTest Passed
11/ 17 Testing MultigridDoubleTest Passed
12/ 17 Testing MultigridMixedTest Passed
13/ 17 Testing ProjectDoubleTest Passed
14/ 17 Testing SamplingTest Passed
15/ 17 Testing NSTest Passed
16/ 17 Testing LockExTest Passed
17/ 17 Testing PCGTest Passed

100% tests passed, 0 tests failed out of 17
[patfla@localhost sm12-rel]$

And oh yes (said this before but) the 240M has only 1 GB of memory.

Jonathan Cohen

unread,
Sep 12, 2010, 2:36:38 PM9/12/10
to opencurr...@googlegroups.com
Check out Appendix G in the programming guide:

http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf

Also, see appendix A for a list of which GPU supports which sm architecture (aka "compute capability")

I'm a bit surprised that it would crash under 1.2, though, since all compute capabilities are supposed to be backwards compatible.  Code compiled for 1.2 should run on a 1.3-capable device.  That actually sounds like a possible driver bug to me.

-Jon

patfla

unread,
Sep 12, 2010, 3:32:55 PM9/12/10
to opencurrent-users
On Sep 12, 11:36 am, Jonathan Cohen <jco...@jcohen.name> wrote:
> Check out Appendix G in the programming guide:
>
> http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NV...
>
> Also, see appendix A for a list of which GPU supports which sm architecture
> (aka "compute capability")
>
> I'm a bit surprised that it would crash under 1.2, though, since all compute
> capabilities are supposed to be backwards compatible.  Code compiled for 1.2
> should run on a 1.3-capable device.  That actually sounds like a possible
> driver bug to me.
>
> -Jon
>
> >  16/ 17 Testing Lo' ckExTest                       Passed
> >  17/ 17 Testing PCGTest                          Passed
>
> > 100% tests passed, 0 tests failed out of 17
> > [patfla@localhost sm12-rel]$
>
> > And oh yes (said this before but) the 240M has only 1 GB of memory.

thanx - I was starting to look through the CUDA toolkit pdf's already.

It crashed under 1.3 but runs successfully under 1.2 which is what you
meant of course.

I'm running the latest 256.40 driver from here:

http://developer.nvidia.com/object/cuda_3_1_downloads.html
Reply all
Reply to author
Forward
0 new messages