THCudaTensor_checkGPU Assertion failed

Chen-Ping Yu

unread,

Sep 26, 2016, 3:54:48 PM9/26/16

to torch7

Hi,

I was trying to train a model using the torch7 imageNet multi-GPU code from Soumith, I have 4 Pascal Titan X and it runs fine with the provided AlexNet using my GPU 1,2,3,4 individually.

Then I switched to my custom network, the code would run only if I specify -GPU 1 (using GPU ID 0), if I use any other GPU I get the following error:

torch/install/share/lua/5.1/nn/THNN.lua:110: Assertion `THCudaTensor_checkGPU(state, 4, input, target, output, total_weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/luarocks_cunn-scm-1-5253/cunn/lib/THCUNN/ClassNLLCriterion.cu:125

The thing is, my custom network does not use any data parallel, and I'm not even trying to perform multi-GPU training, and the fact that it runs fine on GPU ID 0 but not any other GPU ( ID 1,2,3 all gives the above error ) is something strange.

I've re-installed nn, cudnn, cunn, cutorch to no avail, does anyone have any idea about that error? Thanks!

Chen

fchouteau

unread,

Sep 27, 2016, 5:08:50 AM9/27/16

to torch7

Are you sure that you are not assigning some of your gradients/weights to GPU 0 before calling cutorch.setDevice(i)?
Check that idx = cutorch.getDevice() returns the selected GPU everytime/everywhere.

My hypothesis would be that you are assigning something to GPU0 before chosing your GPU, this would by why everything works on GPU 0 but not on the others.

Chen-Ping Yu

unread,

Sep 28, 2016, 11:12:28 AM9/28/16

to torch7

Hi, thanks for the reply.

Something is strange with my custom network's implementation, I tried another very similar custom network and it works fine on different individual GPUs, so I'll dig into my 1st custom network's model implementation to see what's causing this issue...thanks!

Reply all

Reply to author

Forward