Newbie question: getting the memorychecker to work

Jose Juan

unread,

Jun 7, 2010, 3:21:50 PM6/7/10

to gpuocelot

I don't know if this is the correct medium for asking support related
questions or if this is a development mailist, however this is the
only related place I could find...

After I try to get the example described in (http://code.google.com/p/
gpuocelot/wiki/MemoryChecker) to work

I get

terminate called after throwing an instance of 'hydrazine::Exception'
what(): In function - cudaMemcpy - invalid memory access at
0x9044bd0(512 bytes)
Aborted

for input parameter 0 and

terminate called after throwing an instance of 'hydrazine::Exception'
what(): [PC 8] [thread 0] [cta 0] st.global.s32 [%r7 + 0], %r4 -
Global memory access 0x92e3bd0 is not within any allocated or mapped
range.

Nearby Device Allocations
[0x92e4f10] - [0x92e5110] (512 bytes)

Near src/cuda.cu:10:0

Aborted

for input parameter 1, instead of the more verbose output indicated.

On a maybe related problem, even after correcting the error in the
example and launching the kernel with init<<< 4, 32 >>>( deviceA ); I
get

terminate called after throwing an instance of
'boost::archive::archive_exception'
what(): stream error
Aborted

Gregory Frederick Diamos

unread,

Jun 7, 2010, 3:50:08 PM6/7/10

to gpuocelot

On Jun 7, 12:21 pm, Jose Juan <jjta...@gmail.com> wrote:
> I don't know if this is the correct medium for asking support related
> questions or if this is a development mailist, however this is the
> only related place I could find...

It is a bit of both as most of the developers here end up doing
support as well :)

>
> After I try to get the example described in (http://code.google.com/p/
> gpuocelot/wiki/MemoryChecker) to work
>
> I get
>
> terminate called after throwing an instance of 'hydrazine::Exception'
> what(): In function - cudaMemcpy - invalid memory access at
> 0x9044bd0(512 bytes)
> Aborted
>
> for input parameter 0 and
>
> terminate called after throwing an instance of 'hydrazine::Exception'
> what(): [PC 8] [thread 0] [cta 0] st.global.s32 [%r7 + 0], %r4 -
> Global memory access 0x92e3bd0 is not within any allocated or mapped
> range.
>
> Nearby Device Allocations
> [0x92e4f10] - [0x92e5110] (512 bytes)
>
> Near src/cuda.cu:10:0
>
> Aborted
>
> for input parameter 1, instead of the more verbose output indicated.

What do you mean by more verbose output? I know that previous
versions of ocelot have printed out more information (type of
allocation, dimensions, etc), but this changed when the memory checker
was extracted from the core and included as a plugin. We are still
trying to find a good balance between concise and descriptive error
messages. If you have any suggestions I would be glad to hear them.

>
> On a maybe related problem, even after correcting the error in the
> example and launching the kernel with init<<< 4, 32 >>>( deviceA ); I
> get
>
> terminate called after throwing an instance of
> 'boost::archive::archive_exception'
> what(): stream error
> Aborted

This is because you have another trace generator enabled (one that
wants to emit a trace), but it cannot create the file for some reason
(most trace generators expect there to be a traces directory in the
current working directory). Error handling could really improve for
the trace generators....

Regards,

Greg

Jose Juan

unread,

Jun 7, 2010, 4:01:54 PM6/7/10

to gpuocelot

Thanks for the reply Greg :)

Actually I think that I have a more fundamental problem here. Now that
I've been reading more messages in the mailing list and found that for
testing anything ocelot related it is a good idea to get the
regression test suite to run first, After trying to get it to run it
successfully compiles and links but every test fails with an 'Aborted'
message. I'm not sure what the problem is. I'm using ocelot and
hydrazine versions from the svn repository as indicated by the wiki.
(the failing test output just indicates: did not complete for every
test in 2.2 and 2.3)

On Jun 7, 2:50 pm, Gregory Frederick Diamos

Diamos, Gregory F

unread,

Jun 7, 2010, 4:15:39 PM6/7/10

to gpuo...@googlegroups.com

I would suspect that it is the same boost::archive::archive_exception that you reported before. Could you attach your configure.ocelot file from the directory that the tests are failing?

Regards,

Greg

--
You received this message because you are subscribed to the Google Groups "gpuocelot" group.
To post to this group, send email to gpuo...@googlegroups.com.
To unsubscribe from this group, send email to gpuocelot+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gpuocelot?hl=en.

Jose Juan

unread,

Jun 7, 2010, 5:51:59 PM6/7/10

to gpuocelot

Sure, here it is

{
ocelot: "ocelot-refactored",
version: "1.0.65",
trace: {
enabled: false,
database: "traces/database.trace",
memory: false,
branch: false,
sharedComputation: false,
instruction: false,
parallelism: false,
cacheSimulator: false,
memoryChecker: true,
raceDetector: false,
warpSynchronous: {
enabled: true,
emitHotPaths: true
}
},
cuda: {
implementation: "CudaRuntime",
runtimeApiTrace: "trace/CudaAPI.trace"
},
executive: {
devices: [ llvm ],
optimizationLevel: basic,
workerThreadLimit: 1
}
}

This is the configure.ocelot in cuda2.3. Strangely, after realizing
that the cuda2.2 folder didn't have a configure.ocelot I tried copying
the one in cuda2.3 and re running the tests. After that, 60% of the
tests started successfully running, however I also get some erratic
behavior when 2 out of each 3 tests one test would seg fault and have
my computer freezing for about half a minute. It seems that the lists
of programs tests not completing is constant after several tests:

Failing tests:
(0.140s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
ImageDenoising : Did not complete.
(1.298s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/OceanFFT :
Did not complete.
(1.718s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
QuasirandomGenerator : Did not complete.
(14.227s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
SimpleCUBLAS : Did not complete.
(0.634s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
MonteCarlo : Did not complete.
(0.959s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
BlackScholes : Did not complete.
(0.155s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
SimpleTexture : Did not complete.
(0.303s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/SimpleGL :
Did not complete.
(1.509s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/FluidsGL :
Did not complete.
(1.064s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
ConvolutionFFT2D : Did not complete.
(1.199s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
SimpleCUFFT : Did not complete.
(0.169s) : /home/jjtapia/Downloads/gpuocelot/tests/cuda2.2/
MonteCarloMultiGPU : Did not complete.

The rest are marked as passed. Given the list it would appear it is
some kind of graphic related problem however the volumeRenderer
example does pass (albeit it takes a large time to complete, 32 secs,
when the second longest to pass if Mersenne Twister with 10 secs),
along with others as postprocessGL and sobelfilter.

On Jun 7, 3:15 pm, "Diamos, Gregory F" <Gregory.Dia...@gatech.edu>
wrote:

Diamos, Gregory F

unread,

Jun 7, 2010, 6:55:22 PM6/7/10

to gpuo...@googlegroups.com

Thanks. Okay so it seems like a bug crept into the LLVM translator due to handling of Float/Int instructions that seems to have changed in a recent version of LLVM (it was a bug in Ocelot but only exposed in newer versions of LLVM).

There should be a fix in r581. All systems with version 3.0 of the toolkit installed should expect the following failures:

(1.793s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/FluidsGL : Did not complete.
(16.642s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/SimpleCUBLAS : Did not complete.
(1.467s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/OceanFFT : Did not complete.
(1.613s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/ConvolutionFFT2D : Did not complete.
(1.315s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/SimpleCUFFT : Did not complete.

All systems without an OpenGL 2.0 compliant graphics driver (doesn't have to be nvidia) should expect the additional failures:

(0.332s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/PostProcessGL : Did not complete.
(0.198s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/SimpleGL : Did not complete.
(0.302s) : /home/normal/checkout/gpuocelot/tests/cuda2.2/VolumeRender : Did not complete.

Everything else in cuda2.2 should pass on the emulated, llvm, and nvidia (Tesla not Fermi) devices.

Jose Juan

unread,

Jun 7, 2010, 7:18:08 PM6/7/10

to gpuocelot

Indeed, I get exactly those when I change configure.ocelot to
emulated. I guess I'll wait for the fix before trying to get ocelot
running again. Thanks for your time. :)

On Jun 7, 5:55 pm, "Diamos, Gregory F" <Gregory.Dia...@gatech.edu>
wrote:

Diamos, Gregory F

unread,

Jun 7, 2010, 7:23:49 PM6/7/10

to gpuo...@googlegroups.com

Do you need any specific functionality? There is no way to get the OpenGL applications to work on a system without OpenGL. For the applications that fail on CUBLAS or CUFFT, you need to use a PTX1.4 version of those libraries (toolkit version 2.3). Support for PTX 2.x will be a long time coming.

Jose Juan

unread,

Jun 7, 2010, 8:25:48 PM6/7/10

to gpuocelot

Ok,...what I was trying to set up the emulator mode for my
application. As mentioned my first post, I was trying to set up a toy
project to test ocelot beforehand. After you told me the problem was
with the trace generator I disabled it into the configure.ocelot file
so my file now runs without problems. However when I compile it with
debug information(passing the -g flag to nvcc) and run it into
emulated mode after obtaining the file it would seem that debug
information is lost? Because I'm not able to step in anywhere with the
debugger,,,

/usr/local/cuda/bin/nvcc -g -cuda src/cuda.cu
g++ cuda.cu.cpp `OcelotConfig -l` -o cuda

The commands I'm using...

On Jun 7, 6:23 pm, "Diamos, Gregory F" <Gregory.Dia...@gatech.edu>
wrote:

> ...
>
> read more »

Gregory Diamos

unread,

Jun 7, 2010, 9:10:10 PM6/7/10

to gpuo...@googlegroups.com

Try:

/usr/local/cuda/bin/nvcc -g -cuda src/cuda.cu
g++ -g cuda.cu.cpp `OcelotConfig -l` -o cuda

Regards,

Greg

Jose Juan

unread,

Jun 7, 2010, 10:30:05 PM6/7/10

to gpuocelot

.... ok now i feel like a complete idiot. For some complete
nonsensical reason I forgot g++ was a compiler and was thinking it was
just a linker. Haha, aw well. Thanks for the assistance. :D

Gregory Diamos

unread,

Jun 8, 2010, 4:23:42 AM6/8/10

to gpuo...@googlegroups.com

No worries :)

Jose Juan

unread,

Jun 8, 2010, 11:56:33 AM6/8/10

to gpuocelot

Thanks for all the help

After amending my silly mistake I can now debug the program, no
problems. However, I cannot step into the kernel proper. Even if I set
a breakpoint inside the kernel it will just step over to the end
brace, Here's my test case...

#include <string>

__global__ void init(int* data)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
data[tid] = tid;
}

__host__ void method()
{
int* hostA = new int[128];
memset(hostA, 0, sizeof(int)*128);
int* deviceA;
cudaMalloc( (void**) &deviceA, sizeof(int)*128 );
cudaMemcpy( deviceA, hostA, sizeof(int)*128,
cudaMemcpyHostToDevice );

init<<< 4, 32 >>>( deviceA );

cudaFree( deviceA );
delete[] hostA;
}

int main( int argc, char** argv )
{

method();

return 0;
}

This is true for both emulated mode and llvm. In llvm the debugger
will indicate that threads are being created but still it will just
step over the kernel.

Gregory...@gatech.edu

unread,

Jun 8, 2010, 1:16:34 PM6/8/10

to gpuo...@googlegroups.com

So you want to be able to actually step into the kernel (init in this case)?

This is currently not supported and would be extremely difficult to add support for. In short, we would have to write a GDB interface for the emulator, following something like this (http://www.embecosm.com/appnotes/ean3/html/index.html), for example. For LLVM, it may be a bit easier as LLVM has recently been updated to emit debugging information in the JIT, but no one has really looked into this in detail (it was released after the LLVM backend in Ocelot was finished).

Something that will be coming in the more short term will be an attachable kernel-debugger for Ocelot that can be attached to a running kernel and at the very least will allow you to step through instructions and inspect registers, memory, threads, etc.

What is available right now is the ability to get a full or selected instruction trace from a kernel as it is executing. You can do this by setting the debug mode in LLVM to report in configure.ocelot or by changing some of the preprocessor flags in ocelot/executive/implementation/CoopeativeThreadArray.cpp .

Regards,

Greg

----- Original Message -----
From: "Jose Juan" <jjt...@gmail.com>
To: "gpuocelot" <gpuo...@googlegroups.com>

Sent: Tuesday, June 8, 2010 8:56:33 AM GMT -08:00 US/Canada Pacific
Subject: [gpuocelot] Re: Newbie question: getting the memorychecker to work

#include <string>

method();

return 0;
}

--

Diamos, Gregory F

unread,

Jun 8, 2010, 1:51:44 PM6/8/10

to gpuo...@googlegroups.com

Actually I was just taking a look through the recent LLVM sources and it seems like there is an interface for emitting debugging information in the JIT. I'll give it a try on my end and see if I can at least get a stack trace working.

Jose Juan

unread,

Jun 8, 2010, 2:01:23 PM6/8/10

to gpuocelot

Thank you very much. Having such a functionality would indeed be very
helpful. :)

On Jun 8, 12:51 pm, "Diamos, Gregory F" <Gregory.Dia...@gatech.edu>
wrote:

> For more options, visit this group athttp://groups.google.com/group/gpuocelot?hl=en.

Gregory...@gatech.edu

unread,

Jun 8, 2010, 2:36:09 PM6/8/10

to gpuo...@googlegroups.com

Okay, so I was able to get a backtrace into the jitted function:

#0 0x00d8c029 in _Z_ocelotTranslated__Z4initPi ()
#1 0x0052448f in executive::LLVMExecutableKernel::Worker::launchCtaWithoutBarriers (this=0x8ebf440, function=0xd8c010 <_Z_ocelotTranslated__Z4initPi>,
c=0x8ebcf5c) at ocelot/executive/implementation/LLVMExecutableKernel.cpp:1520
#2 0x00524261 in executive::LLVMExecutableKernel::Worker::launchKernelWithoutBarriers (this=0x8ebf440, f=0xd8c010 <_Z_ocelotTranslated__Z4initPi>,
c=0x8ebcf5c, begin=1, end=4, step=2) at ocelot/executive/implementation/LLVMExecutableKernel.cpp:1453
#3 0x005240dc in executive::LLVMExecutableKernel::Worker::execute (this=0x8ebf440) at ocelot/executive/implementation/LLVMExecutableKernel.cpp:1417
#4 0x00429e85 in hydrazine::Thread::_launch (argument=0x8ebf440) at hydrazine/interface/Thread.cpp:261
#5 0x006d396e in start_thread (arg=0xb77e1b70) at pthread_create.c:300
#6 0x008f3a4e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

But the information such as the file name and correspondence between instructions and line numbers is lost. The reason for this is that we do not emit debugging meta-data when we do PTX to LLVM IR translation. This would be possible to do; there is a nice interface for adding debugging information in LLVM (http://llvm.org/docs/SourceLevelDebugging.html) and PTX includes basic information. We would still lose variable names/values, but we could still tell you which line you were on. That being said, there would be a fair amount of work required to implement this (at least one solid week). Also, I am not sure how useful it would be to tell you which line you were on without the ability to inspect the values of variables. We could probably get that information back by reversing the DWARF information inserted when compiling with nvcc -G, but that would be a much more significant effort (someone could spend months working on that).

I'll add an issue to keep track of this one, but most of the developers on Ocelot are more interested in research projects and only spend time adding debugging features if they are low overhead. Getting a full featured debugger up and running that could inspect variables and step through a kernel one instruction at a time would be extremely high overhead. Finally, this might not even be the right way to go, Ocelot does execution model translations when generating CPU code that results in many CUDA threads getting mapped onto a single CPU thread. Even if we could generate debugging information for every variable in the original CUDA source, it would look like the control flow of the program would jump around randomly between statements similar to if you had full optimization enabled in a regular C++ program. I personally think that a new debugger that could attach to the emulator and inspect the state of different threads would be more useful.

Reply all

Reply to author

Forward