More information on this issue:
I am running the 'emulated' device. The problem doesn't happen with other CUDA 4.1 benchmarks like VectorAdd and Transpose. It seems to be related to the CUBLAS library. Turning on some of Ocelot's reporting messages I get:
(3.943287) CudaRuntime.cpp:593: Loading module (fatbin) - /home/buildmeister/build/rel/gpgpu/toolkit/r4.2/cublas/src/
magma_fermi_zgemm.cu(3.943316) CudaRuntime.cpp:736: Registered kernel - _Z24fermiZgemm_v3_kernel_refILb1ELb1ELb1ELb1ELi16ELi24ELi8ELi8ELi8ELb0EEviiiPK7double2iS2_iPS0_iS2_S2_ii in module '/home/buildmeister/build/rel/gpgpu/toolkit/r4.2/cublas/src/
magma_fermi_zgemm.cu'
<a bunch of messages like the previous one>(3.943854) CudaRuntime.cpp:736: Registered kernel - _Z24fermiZgemm_v3_kernel_valILb0ELb0ELb0ELb0ELi16ELi24ELi8ELi8ELi8ELb0EEviiiPK7double2iS2_iPS0_iS0_S0_ii in module '/home/buildmeister/build/rel/gpgpu/toolkit/r4.2/cublas/src/
magma_fermi_zgemm.cu'
(3.943874) CudaRuntime.cpp:672: cudaRegisterTexture('cublasZgemmMagmaTexA, dim: 1, norm: 0, ext: 0
(3.943896) CudaRuntime.cpp:672: cudaRegisterTexture('cublasZgemmMagmaTexB, dim: 1, norm: 0, ext: 0
(3.943923) FatBinaryContext.cpp:60: Found new fat binary format!
(3.943940) FatBinaryContext.cpp:65: binary size is: 390152 bytes
(3.943954) FatBinaryContext.cpp:79: Assertion message: Binary contains no PTX.
MatrixMul: ocelot/cuda/implementation/FatBinaryContext.cpp:79: cuda::FatBinaryContext::FatBinaryContext(const void*): Assertion `entry->type & FATBIN_2_PTX' failed.
Aborted