cifar10 fails when training on multiple Nvidia Tesla P100 GPUs

36 views
Skip to first unread message

Michael Chen

unread,
Sep 26, 2017, 2:48:58 PM9/26/17
to Caffe Users
Hi,

I've been looking into an issue where cifar10 fails to start training when running on multiple P100 GPU cards.  1 P100 is fine, 1 or more M40 is fine, this only failed on multiple P100.  

Here is the corresponding github issue I've submitted: https://github.com/NVIDIA/caffe/issues/422

cifar10 gets stuck at this line: CUDA_CHECK(cudaStreamSynchronize(comm_stream_->get())); inside parallel.cpp

Has anyone else encountered this problem and has a work around or solved it?

Thanks,

Michael

Michael Chen

unread,
Oct 17, 2017, 12:44:55 PM10/17/17
to Caffe Users
Hi,

Just wanted to update everyone on the solution.

We engaged with Nvidia and found out it was related to some ACS bios settings that was not set correctly because our BIOS was not up to date.  Once the ACS setting was correct, GPU cards were able to talk to each other through the P2P bus.

Thanks,

Michael 
Reply all
Reply to author
Forward
0 new messages