Following the object detection example in GitHub (
https://github.com/NVIDIA/DIGITS/tree/master/examples/object-detection) I was able to load the dataset, create the DetectNet network, and start training. I had set the batch size to 2 and the batch accumulation to 5 to support running on a 4GB K2200 GPU. The training ran for about 15 hours and then stopped with the following error:
ERROR: Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
The error occurred while running the example on a single card in a two card system. I had previously run this example on two K2200 GPU cards (same parameters) and it run successfully. Is this error indicative of an out of memory error?
The above network had trained for 15 Epochs and I want to train it for 30. Is it possible to continue the training from where it left off? I am assuming that the only way to do this is to save the the last Epoch as a pre-trained model, clone the job, and base the new run off of the pre-trained model?
Thanks,
Ken