NVIDIA Announces CUDA 4.0

Rod

unread,

Feb 28, 2011, 3:04:17 PM2/28/11

to gpuocelot

I thought I'd share this information with the group:

http://www.anandtech.com/show/4198/nvidia-announces-cuda-40

Best,

Rod

Gregory Diamos

unread,

Feb 28, 2011, 4:06:42 PM2/28/11

to gpuo...@googlegroups.com

Yeah we've seen this coming for a while. The addition of a new smarter
cudaMemcpy and the simpler context management will be welcome
additions. It should integrate well with the current implementation of
the Cuda Runtime as we are already doing the same context management.
The global address space will work out of the box with the emulator and
llvm devices since they sit in the host address space.

Rod, in terms of allocating pointers on the AMD devices, will this be
difficult to support? Is there a chance that a pointer on an AMD device
can alias a host pointer?

Also, with every new version of the runtime, there will be some changes
to the Runtime API that we need to spend some time supporting. I don't
foresee any major challenges here, but it will take some effort.

Regards,

Greg

Rod

unread,

Feb 28, 2011, 7:46:41 PM2/28/11

to gpuocelot

Hi Greg,

On Feb 28, 4:06 pm, Gregory Diamos <gregory.dia...@gatech.edu> wrote:
> Yeah we've seen this coming for a while. The addition of a new smarter
> cudaMemcpy and the simpler context management will be welcome
> additions. It should integrate well with the current implementation of
> the Cuda Runtime as we are already doing the same context management.
> The global address space will work out of the box with the emulator and
> llvm devices since they sit in the host address space.
>
> Rod, in terms of allocating pointers on the AMD devices, will this be
> difficult to support? Is there a chance that a pointer on an AMD device
> can alias a host pointer?

Right now, the AMD backend does all the memory management and uses an
arbitrary address as the starting point for the device pointers
(different from zero since it has a special NULL meaning). If the
range for the device pointers in the UVA is deterministic then we
could adjust the starting point to that range. But it's hard for me to
tell right now without the details.

>
> Also, with every new version of the runtime, there will be some changes
> to the Runtime API that we need to spend some time supporting. I don't
> foresee any major challenges here, but it will take some effort.

Ok. I will try my best to keep up with the changes.

Ryuta Suzuki

unread,

Mar 16, 2011, 11:33:08 PM3/16/11

to gpuo...@googlegroups.com

Hi,

Did anyone try to run CUDA 4.0-compiled program on Ocelot?

My initial attempt ended up with the error:

(0.002641) CudaRuntime.cpp:555: Assertion message: binary contains no PTX

cusp-gmres: /home/ryuta/devel/ocelot/src/gpuocelot/ocelot/ocelot/cuda/implementation/CudaRuntime.cpp:555: virtual void** cuda::CudaRuntime::cudaRegisterFatBinary(void*): Assertion `binary->ptx != 0' failed.

Aborted

I was wondering if this is know issue.

BTW, my machine is 32bit.

Thanks,

Ryuta

Gregory Diamos

unread,

Mar 17, 2011, 4:51:56 AM3/17/11

to gpuo...@googlegroups.com

Yeah, the fat binary format definitely changed in 4.0 and the headers in
cuda/include do not match any more. I'm looking into it.

> --
> You received this message because you are subscribed to the Google
> Groups "gpuocelot" group.
> To post to this group, send email to gpuo...@googlegroups.com.
> To unsubscribe from this group, send email to
> gpuocelot+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/gpuocelot?hl=en.

Gregory Diamos

unread,

Mar 17, 2011, 8:09:17 AM3/17/11

to gpuo...@googlegroups.com

The good news is that the PTX is still in there. The bad news is that
the the new format seems completely undocumented. I was able to get a
few examples running assuming that the new format is as follows:

typedef struct __cudaFatCudaBinary2HeaderRec {
char unknown[20];
unsiged int offset;
} __cudaFatCudaBinary2Header;

typedef struct __cudaFatCudaBinaryRec2 {
int magic;
int version;
const unsigned long long* fatbinData;
char* f;
} __cudaFatCudaBinary2;

The offset in the header gives the start of an ELF file. Here's a dump
from a sample file:

normal@atom:~/temp/fatbin$ objdump -x fatbinstripped2.elf

fatbinstripped2.elf: file format elf64-little
fatbinstripped2.elf
architecture: UNKNOWN!, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x0000000000000000

Program Header:
PHDR off 0x0000000000000684 vaddr 0x0000000000000000 paddr
0x0000000000000000 align 2**2
filesz 0x0000000000000070 memsz 0x0000000000000070 flags r-x
0x60000000 off 0x0000000000000466 vaddr 0x0000000000000000 paddr
0x0000000000000000 align 2**2
filesz 0x00000000000001a4 memsz 0x00000000000001a4 flags r-x a00

Sections:
Idx Name Size VMA LMA File off
Algn
0 .text.kernelEntry 00000130 0000000000000000 0000000000000000
00000466 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .nv.constant0.kernelEntry 0000002c 0000000000000000
0000000000000000 00000596 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .nv.info.kernelEntry 00000048 0000000000000000 0000000000000000
000005c2 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .nv.info 00000078 0000000000000000 0000000000000000
0000060a 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
SYMBOL TABLE:
0000000000000000 l d *ABS* 0000000000000000 .shstrtab
0000000000000000 l d *ABS* 0000000000000000 .strtab
0000000000000000 l d *ABS* 0000000000000000 .symtab
0000000000000000 l d *UND* 0000000000000000
0000000000000000 l d *UND* 0000000000000000
0000000000000000 l d .text.kernelEntry 0000000000000130
.text.kernelEntry
0000000000000000 l d .nv.info.kernelEntry 0000000000000000
.nv.info.kernelEntry
0000000000000000 l d .nv.info 0000000000000000 .nv.info
0000000000000000 l d .nv.constant0.kernelEntry 0000000000000000
.nv.constant0.kernelEntry
0000000000000000 g F .text.kernelEntry 00000000000000f0 0x10 kernelEntry
00000000000000f0 g F .text.kernelEntry 0000000000000010 funcTriple
0000000000000100 g F .text.kernelEntry 0000000000000010 funcPentuple
0000000000000110 g F .text.kernelEntry 0000000000000010 funcQuadruple
0000000000000120 g F .text.kernelEntry 0000000000000010 funcDouble

So we definitely need an ELF reader to be able to load this. The good
news is, though, that this actually gives us a usable symbol table for
PTX modules. So we can be very aggressive in how we lazily load PTX if
we choose to take advantage of this.

We also have a new version of PTX (2.3), which is more or less identical
to 2.2.

I'm currently working on the SCons branch so the fix went in there
first. It will make its way into the trunk after some more testing.

Any pointers to a good some BSD-licensed source code for an ELF reader
would be appreciated.

Regards,

Greg

Ryuta Suzuki

unread,

Mar 17, 2011, 9:19:01 AM3/17/11

to gpuo...@googlegroups.com

Thanks for taking a look at it.

Only ELF reader I can think of is the ELF reader of LLDB

http://llvm.org/viewvc/llvm-project/lldb/trunk/source/Plugins/ObjectFile/ELF/

although it's not a stand-alone ELF reader...

Plus, it's distributed under LLVM license, "BSD-style" license.

Regards,

Ryuta

Rod

unread,

Mar 17, 2011, 9:37:57 AM3/17/11

to gpuocelot

Greg,

I haven't installed CUDA 4.0 on my machine but do you mean that the
new /usr/local/cuda/include/__cudaFatFormat.h file doesn't define the
__cudaFatCudaBinary format?

The Valgrind ELF reader may help here although it is GPLv2-licensed.

Gregory Diamos

unread,

Mar 17, 2011, 11:50:55 AM3/17/11

to gpuo...@googlegroups.com

I think that it is telling that they chose to implement it themselves
rather than using an existing library. I may just end up doing that,
but thanks for the reference, it is always useful to have a reference
implementation to start with.

Greg

Gregory Diamos

unread,

Mar 17, 2011, 11:54:32 AM3/17/11

to gpuo...@googlegroups.com

do you mean that the
new /usr/local/cuda/include/__cudaFatFormat.h file doesn't define the
__cudaFatCudaBinary format?

Unfortunately that is exactly what I mean. It seems like there are now
multiple fat binary formats that are distinguished by a magic word. The
old format is the same, but the new format is generated by nvcc 4.0 by
default.

Regards,

Greg

Gregory Diamos

unread,

Mar 19, 2011, 10:12:54 PM3/19/11

to gpuo...@googlegroups.com

There is a patch in the SCons branch with support for the new fat binary
format. I am able to successfully the basic regression tests using this
change. I'm going to work to integrate this branch in with the trunk
next week.

Greg

Reply all

Reply to author

Forward