1. If you are multi-GPU and changing cudnn_conv_layer.cpp does not work, you can first test whether a single GPU can work, if the single GPU is OK, but the multi-GPU is not OK, it may be the version of Nccl and Cuda,
2. Please try the example of multi-GPU running Caffe official website such as Mnist + lenet, if Mnist + lenet is possible. Then reduce the size of a certain layer in your network structure as far as possible to keep each layer with few parameters, verify Is it the problem of the size of nccl communication data
3. Then you can use the nccl_test tool to perform a data test on your multi-card. If you are nccl2.x, you can use `https: //
github.com / NVIDIA / nccl-tests` This tool is more useful for your nccl Perform a test on the data of GPU communication. If you are nccl1.x, this tool is integrated into nccl.
4. nccl_test will test the communication between multiple cards, then you can set the number of GPUs and the maximum data size of the communication, refer to the readme.md of nccl_test. Finally, if you are running nccl_test, if nccl is not compatible with your cuda Then, when the amount of data exceeds a certain size, the misaliend address will be reported, then at this time you can consider replacing the version of cuda and nccl. One available version I tried is cuda10.1 + nccl2.3.5-5 + caffe_windows
Of course, this is the result of my test on ubuntu18.04. I will be honored if it is helpful to your question!