Running Caffe using Nvidia's MPS

已查看 815 次
跳至第一个未读帖子

Scott Lanahan

未读,
2014年12月10日 12:35:152014/12/10
收件人 caffe...@googlegroups.com
Greetings,

I am trying to run multiple training sessions using Caffe piped into MPS. I found this thread on github: https://github.com/BVLC/caffe/issues/1427

So here I am.

When trying to run Caffe with MPS I constantly receive the following error:
    math_functions.cpp:91] Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure

If I launch one training session it works just fine, if I then start another it spits out the error above in the first session and the second completely hangs.

The first window:

I1210 12:20:50.292115  6876 solver.cpp:160] Solving LogisticRegressionNet
I1210 12:20:50.292163  6876 solver.cpp:247] Iteration 0, Testing net (#0)
I1210 12:20:55.199916  6876 solver.cpp:298]     Test net output #0: loss = 0.700242 (* 1 = 0.700242 loss)
I1210 12:20:55.409710  6876 solver.cpp:191] Iteration 0, loss = 0.700957
I1210 12:20:55.409735  6876 solver.cpp:206]     Train net output #0: loss = 0.700957 (* 1 = 0.700957 loss)
I1210 12:20:55.409750  6876 solver.cpp:403] Iteration 0, lr = 0.1
I1210 12:21:15.858656  6876 solver.cpp:191] Iteration 100, loss = 0.0993546
I1210 12:21:15.858696  6876 solver.cpp:206]     Train net output #0: loss = 0.0993546 (* 1 = 0.0993546 loss)
I1210 12:21:15.858708  6876 solver.cpp:403] Iteration 100, lr = 0.1
I1210 12:21:36.314012  6876 solver.cpp:191] Iteration 200, loss = 0.0974024
I1210 12:21:36.314079  6876 solver.cpp:206]     Train net output #0: loss = 0.0974024 (* 1 = 0.0974024 loss)
I1210 12:21:36.314090  6876 solver.cpp:403] Iteration 200, lr = 0.1
F1210 12:21:54.647020  6876 math_functions.cpp:91] Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure
*** Check failure stack trace: ***
    @     0x7f1ab2494daa  (unknown)
    @     0x7f1ab2494ce4  (unknown)
    @     0x7f1ab24946e6  (unknown)
    @     0x7f1ab2497687  (unknown)
    @           0x49aa05  caffe::caffe_copy<>()
    @           0x4c164d  caffe::HDF5DataLayer<>::Forward_gpu()
    @           0x464e5b  caffe::Net<>::ForwardFromTo()
    @           0x465287  caffe::Net<>::ForwardPrefilled()
    @           0x45b8b9  caffe::Solver<>::Solve()
    @           0x416072  train()
    @           0x410941  main
    @     0x7f1aad409ec5  (unknown)
    @           0x414aa7  (unknown)
    @              (nil)  (unknown)
Aborted


The second:
I1210 12:21:49.591974  6896 solver.cpp:160] Solving LogisticRegressionNet
I1210 12:21:49.592030  6896 solver.cpp:247] Iteration 0, Testing net (#0)


I followed the directions for MPS here: http://cudamusing.blogspot.fr/2013/07/enabling-cuda-multi-process-service-mps.html
This setup works fine for the Nvidia sample programs so I believe it is operating correctly.

Has anyone gotten this to work? If not, does anyone know what needs to be changed in Caffe source in order for it to work correctly?

Thanks,
Scott

Scott Lanahan

未读,
2014年12月10日 17:01:472014/12/10
收件人 caffe...@googlegroups.com
I'm going to outline what I'm looking at while trying to fix this so bear with me. I should add that I'm completely new to CUDA, having experience with OpenCL some years ago. Therefore, if I make an error in logic I would appreciate correction.

The error I'm receiving is a cudaErrorLaunchFailure. The error comes from the CUDA_CHECK call on the cudamemcpy function. The documentation says this about the error:
"An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA."



Caffe makes heavy use of the cudamemcpy function outlined here: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html

The NVIDIA MPS documentation says this about the cudamemcpy function:
"MPS client processes allocate memory from different partitions of the same GPU virtual address space.

As a result:
* An out-of-range write in a CUDA Kernel can modify the CUDA-accessible memory state of another process, and will not trigger an error.
* An out-of-range read in a CUDA Kernel can access CUDA-accessible memory modified by another process, and will not trigger an error, leading to undefined behavior.
This behavior is constrained to memory accesses from pointers within CUDA Kernels. Any CUDA API restricts MPS clients from accessing any resources outside of that MPS Client's memory partition. For example, it is not possible to overwrite another MPS client's memory using the cudaMemcpy() API."

Due to the way this is worded, I'm hesitant to believe that the call to cudamemcpy is responsible for the error.
The call starts here: https://github.com/BVLC/caffe/blob/737ea5e936821b5c69f9c3952d72693ae5843370/src/caffe/layers/hdf5_data_layer.cu#L40-45
Which goes to here: https://github.com/BVLC/caffe/blob/737ea5e936821b5c69f9c3952d72693ae5843370/src/caffe/util/math_functions.cpp#L85-99

I believe the error may be arising from the CUDA initialization in the second instance. I am looking into this next.

Scott Lanahan

未读,
2014年12月11日 12:48:042014/12/11
收件人 caffe...@googlegroups.com
After looking over the initialization routines I saw nothing wrong. I ended up commenting out the CUDA_CHECK function which led me to an error with CUBLAS. This was odd so I ended up checking my driver and it was 340.29 instead of being 340.69, which is the one I needed.

Updating the driver fixed it and it's currently running without any problems.

revspooner

未读,
2014年12月12日 05:43:322014/12/12
收件人 Scott Lanahan、caffe...@googlegroups.com
Hi Scott, what kind of performance improvements do you see using MPS?

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/2bd98a49-7328-40a4-8522-c54dd2094abe%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Scott Lanahan

未读,
2014年12月12日 10:55:412014/12/12
收件人 caffe...@googlegroups.com、lana...@gmail.com
Hi Robert,

For me the performance increase is remarkable. I'm doing SGD on numerical data and was running into a problem with underutilization of the GPU. Before using MPS I could only run one training session at a time or I would face serious performance degradation, as each process was competing for the GPU. Currently I'm running eight simultaneous training processes without very much slowdown at all. The only caveat is that the NVIDIA MPS server consumes, on average, 1.5 - 2.0 worth of load.

I should add I'm using a Quadro K5200 which has support for 12 processes (as far as I know anyway), so I could probably push the number higher if I wanted to.

lvng...@umich.edu

未读,
2016年6月10日 23:29:582016/6/10
收件人 Caffe Users、lana...@gmail.com
Hi Scott,

I found this post just now and currently trying to work on the same problem. Do you still keep the code by any chance (even a minimal example would be much appreciated)?

Thanks,

Marcelo Amaral

未读,
2017年4月19日 08:19:532017/4/19
收件人 Caffe Users、lana...@gmail.com

Hi all,
Have you managed to run Caffe with MPS?
When I run two instance of Caffe within MPS, but one fails with the error:

F0419 07:03:26.891240 21095 syncedmem.hpp:18] Check failed: error == cudaSuccess (14 vs. 0)  mapping of buffer object failed
*** Check failure stack trace: ***
    @     0x3fffb6b9ccfc  google::LogMessage::Fail()
    @     0x3fffb6b9f88c  google::LogMessage::SendToLog()
    @     0x3fffb6b9c6ec  google::LogMessage::Flush()
    @     0x3fffb6ba0464  google::LogMessageFatal::~LogMessageFatal()
    @     0x3fffb6fd629c  caffe::SyncedMemory::mutable_cpu_data()
    @     0x3fffb6e47938  caffe::Blob<>::mutable_cpu_data()
    @     0x3fffb6e997a0  caffe::GaussianFiller<>::Fill()
    @     0x3fffb6f24604  caffe::InnerProductLayer<>::LayerSetUp()
    @     0x3fffb6fa49e8  caffe::Net<>::Init()
    @     0x3fffb6fa6538  caffe::Net<>::Net()
    @     0x3fffb6fb8e20  caffe::Solver<>::InitTrainNet()
    @     0x3fffb6fb94fc  caffe::Solver<>::Init()
    @     0x3fffb6fb98e0  caffe::Solver<>::Solver()
    @     0x3fffb6fd3fc4  caffe::Creator_SGDSolver<>()
    @         0x10010e54  caffe::SolverRegistry<>::CreateSolver()
    @     0x3fffb6fb1548  caffe::P2PSync<>::P2PSync()
    @     0x3fffb6fb2dd0  caffe::P2PSync<>::Run()
    @         0x1000a624  train()
    @         0x100078a0  main
    @     0x3fffb5fa4700  generic_start_main.isra.0
    @     0x3fffb5fa48f4  __libc_start_main
    @              (nil)  (unknown)
回复全部
回复作者
转发
0 个新帖子