Restore training: Does caffe save the lmdb cursor to solverstate and restore this cursor?

Quan Nguyen Manh

unread,

Oct 30, 2017, 4:13:51 PM10/30/17

to Caffe Users

Hi all,

I am training a CNN model on a large lmdb file. Occasionally the training got hung up, so I have to frequently restore the solver state. My question is: Does caffe save the last cursor of the database so that when restore training it will read from where it stopped,
or does it read from the beginning of the lmdb file again?

The reason I am asking this question is because I have two input layer, one reads from lmdb file and the other reads from memory (a MemoryDataLayer). The data of these two input layers need to be aligned. Very often after restoring the solver state, the loss does not
seem to converge, so I suspect that caffe reads the lmdb file from the beginning, which makes data in two input layers mismatched.

Thank you very much.

Przemek D

unread,

Nov 6, 2017, 7:03:56 AM11/6/17

to Caffe Users

I actually spent a while researching this and trying to come up with an answer but I must admit defeat.
When a snapshot is saved, caffe::Net::ToProto() is called, which in turn calls caffe::Layer::ToProto() on each and every layer of the network. The thing is, I know the abstract caffe::Layer declares a virtual method ToProto(), but I have no idea how it is defined. I suppose this method has to be automatically generated by the Google Protobuf compiler, because most layers contain their LayerParameter - each defined in caffe.proto - which is processed by the protobuf tool first. However, I had no luck finding where is it defined - caffe.pb.h nor caffe.pb.cc do not contain any trace of it. Strangely, the compiled .o files per each layer, upon hex inspection, revealed some symbols containing the "ToProto" string.
I'd be happy to see a more educated answer on this.

Hieu Do Trung

unread,

Nov 7, 2017, 3:29:36 AM11/7/17

to Caffe Users

TLTR: caffe reads the lmdb file from the beginning.

Long answer:

I used the following code snippet to write out 1st channel of input data blob as image (tested on GoogleNet).

Step 1. Train from scratch, running for about 30 iterations (set batch size to 1 so each iteration output 1 image only), then stop.

Step 2. Move the output images to another folder, say TrainImages.

Step 3. Resume training from the solverstate file created at Step 1, running for about 30 iterations again, then stop.

Step 4. Move the output images to another folder, say ResumeImages.

Now compare output images in TrainImages and ResumeImages.

They are the same.

Which means that input images at iteration 31 (resumed from solverstate) are from beginning of lmdb file.

// debug code start

#include <string>

#include <opencv2/core/core.hpp>

#include <opencv2/highgui/highgui.hpp>

#include <opencv2/highgui/highgui_c.h>

#include <opencv2/imgproc/imgproc.hpp>

int iteration = 0;

// debug code end

template <typename Dtype>

Dtype Net<Dtype>::ForwardFromTo(int start, int end) {

CHECK_GE(start, 0);

CHECK_LT(end, layers_.size());

Dtype loss = 0;

for (int i = start; i <= end; ++i) {

// LOG(ERROR) << "Forwarding " << layer_names_[i];

Dtype layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]);

loss += layer_loss;

if (debug_info_) { ForwardDebugInfo(i); }

///////////////////////////////////////////////////////////////////////

// debug code start

for (int top_id = 0; top_id < top_vecs_[i].size(); ++top_id) {

const Blob<Dtype>& blob = *top_vecs_[i][top_id];

const string& blob_name = blob_names_[top_id_vecs_[i][top_id]];

int phase = this->phase();

if (phase > 0) continue; // skip TEST phase

if (blob_name.compare("data") == 0) {

int numAxes = blob.num_axes();

if (numAxes > 3) {

int h = blob.shape(2);

int w = blob.shape(3);

const Dtype* data = blob.cpu_data();

float *raw = (float*)data;

cv::Mat img(h, w, CV_32FC1, raw);

cv::Mat cv_img;

img.convertTo(cv_img, CV_8UC1);

char buffer [32];

sprintf (buffer, "%d", iteration);

std::string sId = buffer;

std::string path = "/home/user/Desktop/tmp/" + sId + ".jpg";

cv::imwrite(path, cv_img);

iteration++;

}

// debug code end

///////////////////////////////////////////////////////////////////////

}

return loss;

}

Reply all

Reply to author

Forward