Running Caffe using Nvidia's MPS

Scott Lanahan

未读，

2014年12月10日 12:35:152014/12/10

收件人 caffe...@googlegroups.com

Greetings,

I am trying to run multiple training sessions using Caffe piped into MPS. I found this thread on github: https://github.com/BVLC/caffe/issues/1427

So here I am.

When trying to run Caffe with MPS I constantly receive the following error:
    math_functions.cpp:91] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure

If I launch one training session it works just fine, if I then start another it spits out the error above in the first session and the second completely hangs.

The first window:

I1210 12:20:50.292115 6876 solver.cpp:160] Solving LogisticRegressionNet
I1210 12:20:50.292163 6876 solver.cpp:247] Iteration 0, Testing net (#0)
I1210 12:20:55.199916 6876 solver.cpp:298]     Test net output #0: loss = 0.700242 (* 1 = 0.700242 loss)
I1210 12:20:55.409710 6876 solver.cpp:191] Iteration 0, loss = 0.700957
I1210 12:20:55.409735 6876 solver.cpp:206]     Train net output #0: loss = 0.700957 (* 1 = 0.700957 loss)
I1210 12:20:55.409750 6876 solver.cpp:403] Iteration 0, lr = 0.1
I1210 12:21:15.858656 6876 solver.cpp:191] Iteration 100, loss = 0.0993546
I1210 12:21:15.858696 6876 solver.cpp:206]     Train net output #0: loss = 0.0993546 (* 1 = 0.0993546 loss)
I1210 12:21:15.858708 6876 solver.cpp:403] Iteration 100, lr = 0.1
I1210 12:21:36.314012 6876 solver.cpp:191] Iteration 200, loss = 0.0974024
I1210 12:21:36.314079 6876 solver.cpp:206]     Train net output #0: loss = 0.0974024 (* 1 = 0.0974024 loss)
I1210 12:21:36.314090 6876 solver.cpp:403] Iteration 200, lr = 0.1
F1210 12:21:54.647020 6876 math_functions.cpp:91] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
    @     0x7f1ab2494daa (unknown)
    @     0x7f1ab2494ce4 (unknown)
    @     0x7f1ab24946e6 (unknown)
    @     0x7f1ab2497687 (unknown)
    @           0x49aa05 caffe::caffe_copy<>()
    @           0x4c164d caffe::HDF5DataLayer<>::Forward_gpu()
    @           0x464e5b caffe::Net<>::ForwardFromTo()
    @           0x465287 caffe::Net<>::ForwardPrefilled()
    @           0x45b8b9 caffe::Solver<>::Solve()
    @           0x416072 train()
    @           0x410941 main
    @     0x7f1aad409ec5 (unknown)
    @           0x414aa7 (unknown)
    @              (nil) (unknown)
Aborted

The second:
I1210 12:21:49.591974 6896 solver.cpp:160] Solving LogisticRegressionNet
I1210 12:21:49.592030 6896 solver.cpp:247] Iteration 0, Testing net (#0)

I followed the directions for MPS here: http://cudamusing.blogspot.fr/2013/07/enabling-cuda-multi-process-service-mps.html
This setup works fine for the Nvidia sample programs so I believe it is operating correctly.

Has anyone gotten this to work? If not, does anyone know what needs to be changed in Caffe source in order for it to work correctly?

Thanks,
Scott

Scott Lanahan

未读，

2014年12月10日 17:01:472014/12/10

收件人 caffe...@googlegroups.com

I'm going to outline what I'm looking at while trying to fix this so bear with me. I should add that I'm completely new to CUDA, having experience with OpenCL some years ago. Therefore, if I make an error in logic I would appreciate correction.

The error I'm receiving is a cudaErrorLaunchFailure. The error comes from the CUDA_CHECK call on the cudamemcpy function. The documentation says this about the error:
"An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA."

Caffe makes heavy use of the cudamemcpy function outlined here: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html

The NVIDIA MPS documentation says this about the cudamemcpy function:
"MPS client processes allocate memory from different partitions of the same GPU virtual address space.

As a result:
* An out-of-range write in a CUDA Kernel can modify the CUDA-accessible memory state of another process, and will not trigger an error.
* An out-of-range read in a CUDA Kernel can access CUDA-accessible memory modified by another process, and will not trigger an error, leading to undefined behavior.
This behavior is constrained to memory accesses from pointers within CUDA Kernels. Any CUDA API restricts MPS clients from accessing any resources outside of that MPS Client's memory partition. For example, it is not possible to overwrite another MPS client's memory using the cudaMemcpy() API."

Due to the way this is worded, I'm hesitant to believe that the call to cudamemcpy is responsible for the error.
The call starts here: https://github.com/BVLC/caffe/blob/737ea5e936821b5c69f9c3952d72693ae5843370/src/caffe/layers/hdf5_data_layer.cu#L40-45
Which goes to here: https://github.com/BVLC/caffe/blob/737ea5e936821b5c69f9c3952d72693ae5843370/src/caffe/util/math_functions.cpp#L85-99

I believe the error may be arising from the CUDA initialization in the second instance. I am looking into this next.

Scott Lanahan

未读，

2014年12月11日 12:48:042014/12/11

收件人 caffe...@googlegroups.com

After looking over the initialization routines I saw nothing wrong. I ended up commenting out the CUDA_CHECK function which led me to an error with CUBLAS. This was odd so I ended up checking my driver and it was 340.29 instead of being 340.69, which is the one I needed.

Updating the driver fixed it and it's currently running without any problems.

revspooner

未读，

2014年12月12日 05:43:322014/12/12

收件人 Scott Lanahan、caffe...@googlegroups.com

Hi Scott, what kind of performance improvements do you see using MPS?

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/2bd98a49-7328-40a4-8522-c54dd2094abe%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Scott Lanahan

未读，

2014年12月12日 10:55:412014/12/12

收件人 caffe...@googlegroups.com、lana...@gmail.com

Hi Robert,

For me the performance increase is remarkable. I'm doing SGD on numerical data and was running into a problem with underutilization of the GPU. Before using MPS I could only run one training session at a time or I would face serious performance degradation, as each process was competing for the GPU. Currently I'm running eight simultaneous training processes without very much slowdown at all. The only caveat is that the NVIDIA MPS server consumes, on average, 1.5 - 2.0 worth of load.

I should add I'm using a Quadro K5200 which has support for 12 processes (as far as I know anyway), so I could probably push the number higher if I wanted to.

lvng...@umich.edu

未读，

2016年6月10日 23:29:582016/6/10

收件人 Caffe Users、lana...@gmail.com

Hi Scott,

I found this post just now and currently trying to work on the same problem. Do you still keep the code by any chance (even a minimal example would be much appreciated)?

Thanks,

Marcelo Amaral

未读，

2017年4月19日 08:19:532017/4/19

收件人 Caffe Users、lana...@gmail.com

Hi all,

Have you managed to run Caffe with MPS?

When I run two instance of Caffe within MPS, but one fails with the error:

F0419 07:03:26.891240 21095 syncedmem.hpp:18] Check failed: error == cudaSuccess (14 vs. 0) mapping of buffer object failed

*** Check failure stack trace: ***

@ 0x3fffb6b9ccfc google::LogMessage::Fail()

@ 0x3fffb6b9f88c google::LogMessage::SendToLog()

@ 0x3fffb6b9c6ec google::LogMessage::Flush()

@ 0x3fffb6ba0464 google::LogMessageFatal::~LogMessageFatal()

@ 0x3fffb6fd629c caffe::SyncedMemory::mutable_cpu_data()

@ 0x3fffb6e47938 caffe::Blob<>::mutable_cpu_data()

@ 0x3fffb6e997a0 caffe::GaussianFiller<>::Fill()

@ 0x3fffb6f24604 caffe::InnerProductLayer<>::LayerSetUp()

@ 0x3fffb6fa49e8 caffe::Net<>::Init()

@ 0x3fffb6fa6538 caffe::Net<>::Net()

@ 0x3fffb6fb8e20 caffe::Solver<>::InitTrainNet()

@ 0x3fffb6fb94fc caffe::Solver<>::Init()

@ 0x3fffb6fb98e0 caffe::Solver<>::Solver()

@ 0x3fffb6fd3fc4 caffe::Creator_SGDSolver<>()

@ 0x10010e54 caffe::SolverRegistry<>::CreateSolver()

@ 0x3fffb6fb1548 caffe::P2PSync<>::P2PSync()

@ 0x3fffb6fb2dd0 caffe::P2PSync<>::Run()

@ 0x1000a624 train()

@ 0x100078a0 main

@ 0x3fffb5fa4700 generic_start_main.isra.0

@ 0x3fffb5fa48f4 __libc_start_main

@ (nil) (unknown)

回复全部

回复作者