Training on KITTI data set for object detection

371 views
Skip to first unread message

Ken M Erney

unread,
Sep 15, 2016, 8:01:24 AM9/15/16
to DIGITS Users
Following the object detection example in GitHub (https://github.com/NVIDIA/DIGITS/tree/master/examples/object-detection) I was able to load the dataset, create the DetectNet network, and start training.  I had set the batch size to 2 and the batch accumulation to 5 to support running on a 4GB K2200 GPU.  The training ran for about 15 hours and then stopped with the following error:

ERROR: Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

The error occurred while running the example on a single card in a two card system.  I had previously run this example on two K2200 GPU cards (same parameters) and it run successfully.  Is this error indicative of an out of memory error?

The above network had trained for 15 Epochs and I want to train it for 30.  Is it possible to continue the training from where it left off?  I am assuming that the only way to do this is to save the the last Epoch as a pre-trained model, clone the job, and base the new run off of the pre-trained model?

Thanks,
Ken

Luke Yeager

unread,
Sep 15, 2016, 6:08:09 PM9/15/16
to DIGITS Users
I believe that's a bug that was fixed relatively recently on the caffe-0.15 branch. Can you grab the latest code on the branch to test?
https://github.com/NVIDIA/caffe/commits/caffe-0.15

If you were using deb packages to install, you can keep using the deb package for DIGITS if you like:

Ken M Erney

unread,
Sep 15, 2016, 10:50:51 PM9/15/16
to DIGITS Users
Thanks Luke, I will pull the latest version of Nvidia caffe and rebuild.  I am using the source version of the digits stack so it should be pretty straight forward to update caffe.  

One question... once caffe is updated is is possible to continue the training from Epoch 15 or should I start the training over again?

Greg Heinrich

unread,
Sep 16, 2016, 9:46:09 AM9/16/16
to DIGITS Users
Hello Ken,
you can start training a new model off the weights of the last snapshot of your failed job if you select your model/snapshot in the "pretrained models" tab of the model creation form. It should help converge faster though this isn't technically identical to resuming the previously failed job as the solver settings will be reset. You can tweak the learning rate to make it equal to the last learning rate of the failed job however some other settings like momentum can't be restored.

Regards.

Ken M Erney

unread,
Sep 16, 2016, 10:18:00 AM9/16/16
to DIGITS Users
Hi Luke, I just checked my build of caffe and I am using the latest version of the 0.15 branch (v0.15.13).  There is a 0.16 branch but I am a little hesitant to switch to that ... unless you think it is unlikely to break the current DIGITS install.  What do you think?  I am also going to try and restart the job as Greg mentions below and see if it will continue to train.

Thanks,
Ken


On Thursday, September 15, 2016 at 6:08:09 PM UTC-4, Luke Yeager wrote:

Ken M Erney

unread,
Sep 16, 2016, 10:25:22 AM9/16/16
to DIGITS Users
Hi Greg, I went ahead and saved the failed job as a pre-trained model, cloned the failed job, and then modified it to use the pertained model and its running for another 20 epochs.  I will update the post if it fails again.  The same job does work on the .deb install of digits... except, when I was running it on the .deb install, it was training on both GPUs and there were not other jobs running at the same time.

Luke Yeager

unread,
Sep 19, 2016, 1:49:58 PM9/19/16
to DIGITS Users
Our workflow is for people to use the default branch at all times. Currently, the default branch is caffe-0.15.
https://github.com/NVIDIA/caffe/branches
Reply all
Reply to author
Forward
0 new messages