Hi,
I was trying to train a model using the torch7 imageNet multi-GPU code from Soumith, I have 4 Pascal Titan X and it runs fine with the provided AlexNet using my GPU 1,2,3,4 individually.
Then I switched to my custom network, the code would run only if I specify -GPU 1 (using GPU ID 0), if I use any other GPU I get the following error:
torch/install/share/lua/5.1/nn/THNN.lua:110: Assertion `THCudaTensor_checkGPU(state, 4, input, target, output, total_weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /tmp/luarocks_cunn-scm-1-5253/cunn/lib/THCUNN/ClassNLLCriterion.cu:125
The thing is, my custom network does not use any data parallel, and I'm not even trying to perform multi-GPU training, and the fact that it runs fine on GPU ID 0 but not any other GPU ( ID 1,2,3 all gives the above error ) is something strange.
I've re-installed nn, cudnn, cunn, cutorch to no avail, does anyone have any idea about that error? Thanks!
Chen