cifar10 fails when training on multiple Nvidia Tesla P100 GPUs

36 views

Parallelcaffecifar10cudagpunvidiap100problemtraining

Skip to first unread message

Michael Chen

unread,

Sep 26, 2017, 2:48:58 PM9/26/17

to Caffe Users

Hi,

I've been looking into an issue where cifar10 fails to start training when running on multiple P100 GPU cards. 1 P100 is fine, 1 or more M40 is fine, this only failed on multiple P100.

Here is the corresponding github issue I've submitted: https://github.com/NVIDIA/caffe/issues/422

cifar10 gets stuck at this line: CUDA_CHECK(cudaStreamSynchronize(comm_stream_->get())); inside parallel.cpp

Has anyone else encountered this problem and has a work around or solved it?

Thanks,

Michael

Michael Chen

unread,

Oct 17, 2017, 12:44:55 PM10/17/17

to Caffe Users

Hi,

Just wanted to update everyone on the solution.

We engaged with Nvidia and found out it was related to some ACS bios settings that was not set correctly because our BIOS was not up to date. Once the ACS setting was correct, GPU cards were able to talk to each other through the P2P bus.

Thanks,

Michael

Reply all

Reply to author

Forward

0 new messages