Greetings,
I am trying to run multiple training sessions using Caffe piped into MPS. I found this thread on github:
https://github.com/BVLC/caffe/issues/1427So here I am.
When trying to run Caffe with MPS I constantly receive the following error:
math_functions.cpp:91] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
If I launch one training session it works just fine, if I then start another it spits out the error above in the first session and the second completely hangs.
The first window:
I1210 12:20:50.292115 6876 solver.cpp:160] Solving LogisticRegressionNet
I1210 12:20:50.292163 6876 solver.cpp:247] Iteration 0, Testing net (#0)
I1210 12:20:55.199916 6876 solver.cpp:298] Test net output #0: loss = 0.700242 (* 1 = 0.700242 loss)
I1210 12:20:55.409710 6876 solver.cpp:191] Iteration 0, loss = 0.700957
I1210 12:20:55.409735 6876 solver.cpp:206] Train net output #0: loss = 0.700957 (* 1 = 0.700957 loss)
I1210 12:20:55.409750 6876 solver.cpp:403] Iteration 0, lr = 0.1
I1210 12:21:15.858656 6876 solver.cpp:191] Iteration 100, loss = 0.0993546
I1210 12:21:15.858696 6876 solver.cpp:206] Train net output #0: loss = 0.0993546 (* 1 = 0.0993546 loss)
I1210 12:21:15.858708 6876 solver.cpp:403] Iteration 100, lr = 0.1
I1210 12:21:36.314012 6876 solver.cpp:191] Iteration 200, loss = 0.0974024
I1210 12:21:36.314079 6876 solver.cpp:206] Train net output #0: loss = 0.0974024 (* 1 = 0.0974024 loss)
I1210 12:21:36.314090 6876 solver.cpp:403] Iteration 200, lr = 0.1
F1210 12:21:54.647020 6876 math_functions.cpp:91] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
@ 0x7f1ab2494daa (unknown)
@ 0x7f1ab2494ce4 (unknown)
@ 0x7f1ab24946e6 (unknown)
@ 0x7f1ab2497687 (unknown)
@ 0x49aa05 caffe::caffe_copy<>()
@ 0x4c164d caffe::HDF5DataLayer<>::Forward_gpu()
@ 0x464e5b caffe::Net<>::ForwardFromTo()
@ 0x465287 caffe::Net<>::ForwardPrefilled()
@ 0x45b8b9 caffe::Solver<>::Solve()
@ 0x416072 train()
@ 0x410941 main
@ 0x7f1aad409ec5 (unknown)
@ 0x414aa7 (unknown)
@ (nil) (unknown)
Aborted
The second:
I1210 12:21:49.591974 6896 solver.cpp:160] Solving LogisticRegressionNet
I1210 12:21:49.592030 6896 solver.cpp:247] Iteration 0, Testing net (#0)
I followed the directions for MPS here:
http://cudamusing.blogspot.fr/2013/07/enabling-cuda-multi-process-service-mps.htmlThis setup works fine for the Nvidia sample programs so I believe it is operating correctly.
Has anyone gotten this to work? If not, does anyone know what needs to be changed in Caffe source in order for it to work correctly?
Thanks,
Scott