Multi-class, multi-label classification, multi-target regression and multi-task learning with Caffe

Emmanuel Benazera

unread,

Nov 6, 2015, 9:41:27 AM11/6/15

to Caffe Users, be...@droidnik.fr

Hi,

I am maintaining my own wrapper to Caffe (https://github.com/beniz/deepdetect) and several applications under development that use the package are best modeled as multi-label / multi-task learning. From reading around the forum, it is obvious it isn't clear to everyone how to do so. Also there seems to be some confusion with the modeling of the different types of classification and regression. My aim is to automate this modeling once and for all for my applications. I thought I'd share my current understanding and solutions for each of the classification and regression tasks in the title. Please correct me where I am wrong or the solution is not appropriate, as by collaborating on this we should be able to anchor a definitive set of solutions here and / or on the wiki.

Vocabulary: I use `class` as an instance of a `label` for classification tasks, and I use 'target' as the objective to a regression task. Maybe this is unorthodox, but I hope it makes what follows clearer.

- Multi-class classification (MC)
Usage: predicts a top (or top-n) classes for a single label (e.g. top image class among 1000 possible classes for the single for the image tag as single label). The classes are mutually exclusive.
Caffe: use a softmax layer of the form:

layer {
   name: "loss"
   type: "SoftmaxWithLoss"
   bottom: "ip"
   bottom: "label"
   top: "loss"
   loss_weight: 1
}

'ip' is usually an inner product layer, and label comes as a vector of classes, one class per training/testing example in the batch.

- Multi-label classification (MLC)
(this is miss-named IMO because this is about one label really with classes not being mutually exclusive, but this is how the literature appears to refer to it)
Usage: predicts n among m classes for a label (e.g. tag an image with 'bear' and 'forest' that are among 1000 classes). In general there are two ways to achieve this: either build an equivalent MC task over 2^n classes, or build m binary classifiers as MC tasks.
Caffe: same as MC problem in both cases, though the input must be pre-processed accordingly.
Notes: - There is work on alternatives to the two transformation cases above but I won't detail them here, a well-known one is http://arxiv.org/abs/1312.5419
             - An alternative is to use multi-task learning, see below.

- Multi-target regression (MTR)
Usage: fit n regression targets (e.g. 3 coordinates of an object in continuous space)
Caffe: several ways to achieve this, I have implemented, tested and automated the one that embeds the multiple targets within the data and slice them away at runtime to pass them to the objective function. I find this solution much simpler than the ones requiring multiple databases, but I may have missed something here.
1- embed each training / testing sample targets as a vector of float at the end of a Datum's channel (i.e. features)
   In C++, something like
     std::vector<float> targets = {0.8, 2.5, 3.0};
     Datum dt;
     // fill up with data of interest
     for (float s: targets) {
       dt.add_float_data(s);
     }

2- set your data as input to the net, here I use MemoryData which is the one input layer I work the most with, but this should work similarly for DataLayer etc...
    layer {
      name: "input"
      type: "MemoryData"
      top: "fulldata"
      top: "fake_label"
      include {
        phase: TRAIN
      }
    memory_data_param {
      batch_size: 32
      channels: 26
      height: 1
      width: 1
    }
   }

   Check the trick above: the MemoryData layer requires two tops, one for data, one for a single label, but since our multiple labels are embedded into the 'fulldata' now, we fake the label top here. If someone tries this out with DataLayer, please share.

3- set the slice layer to recover the labels from 'fulldata':
layer {
    name: "slice_labels"
    type: "Slice"
    bottom: "fulldata"
    top: "data"
    top: "label"
    slice_param {
      slice_dim: 1
      slice_point: 23
    }
}

The example uses 23 channels and 3 targets, thus 'fulldata' has channel size 26.
This trick is working for me and is now automated since I don't want to think about it anymore...

- Multi-task learning
Usage: transfer learning across tasks in order predict multiple classes across multiple labels (e.g. predict both gender and age). This is akin to fine-tuning a net, which is equivalent to solving each MC problem one after the other, but there's a way to also learn all labels at once, and this is the one I am using below.
Note: I haven't tested nor automated this one just yet, but will very soon. This post allows me to double check that the solution is indeed correct.
Caffe: the idea is to embed the labels into the data, slice them away as label1, label2, ... and pass them on to their own separated inner product and softmax layers within the same net. This means that the multiple tasks do share the early net layers, and separate / specialize while benefiting from the common set of high-level features.
To do so, first, use the trick above to embed the labels instead of the targets. Then we need to slice them, but this time we do separate the labels, e.g.

layer {
    name: "slice_labels"
    type: "Slice"
    bottom: "fulldata"
    top: "data"
    top: "label1"
    top: "label2"
    top: "label3"
    slice_param {
      slice_dim: 1
      slice_point: 23,24,25
    }
   }

The softmax are then defined as usual, one over each of the 'label1', 'label2', etc... In order to combine the losses into a single generic loss indicator for the net, you can add an importance weight to the softmax which 'loss_weight':0.5 for instance. This step is optional.

Note: the same architecture works for regression with targets, just replace the softmax with a euclidean loss layer for instance.

- Multi-label multi-task learning
This one is left as an exercise, I am stopping here, it is a sunny day :)

Hope the above helps,

Em.

Vimal Thilak

unread,

Nov 6, 2015, 2:44:17 PM11/6/15

to Caffe Users, be...@droidnik.fr

Em,

Nice post. A detail here is to note that you need to shuffle the data manually if you are reading from two different "data" sources since caffe has no notion of synchronizing data from various sources. I suppose you could also use multiple DataLayers (i.e., multiple LMDBs) to store the labels for your different tasks and read them as inputs.

Emmanuel Benazera

unread,

Nov 6, 2015, 4:08:33 PM11/6/15

to Caffe Users, be...@droidnik.fr

Hi Vimal,

One of the advantages of using the slice layers compared to multiple data sources (e.g. dbs) is actually based on my understanding that there's no need to shuffle the data manually anymore. Again, my understanding is that this should work fine with LMDBs and HDF5.

Em.

Vimal Thilak

unread,

Nov 6, 2015, 4:34:27 PM11/6/15

to Caffe Users, be...@droidnik.fr

Hi Em,

Thanks! This is good to know because it will make my life easy. I will double check this point when I get a chance. Currently, I use an LMDB for the labels (a vector because I'm using cross entropy loss to train my network) and ImageDataLayer for my images. I manually shuffle the images and disable shuffling by the caffe framework.

-Vimal

Shravan t r

unread,

Nov 9, 2015, 11:15:00 AM11/9/15

to Caffe Users, be...@droidnik.fr

Hi Emm,

I have read the post. Its very informative and well written. Thanks a lot.

My problem belong to MLC according to your naming. The two methods you suggest for this problem become cumbersome when the number of prediction classes
are large. For example I have 23 classes to classify as a start. This means according to the first method I need to make an equivalent MC task over 2^23 classes or have 23 binary classifiers.

My input is of size 500x500 and the network already has 3 convolution layers with 2 inner product layers. The above method adds to complexity and makes the network very slow.

Another important problem is that context, the network now cannot relate between classes in the same image as we are making each class mutually exclusive of each other. I am hence a bit persistent on using multi label vector as output to retain context and simplicity.

I will post the same in your page, we can continue the discussion in your page if it is helpful and easy.

Thanking you,
Shravan

Emmanuel Benazera

unread,

Nov 9, 2015, 11:41:43 AM11/9/15

to Caffe Users, be...@droidnik.fr

Hey Shravan,

Thanks for joining the thread, I just hope it will it make easier to potential future readers ;)

Two thoughts:
- you may be interested in this PR: https://github.com/BVLC/caffe/pull/3268
I haven't tried it myself. I'm interested in a sample prototxt if you do check it out.

- in your original post (https://groups.google.com/forum/#!topic/caffe-users/sozN9_ypJfw) you mentioned:

"I am trying for a multi label classification problem. I have 23 classes as output. I am encoding my output labels in binary vector format like [ 0 0 0 0 1 1 1 0 ...]. 0 meaning inactive and 1 being an active class. I tried to solve this with sigmoid cross entropy loss as softmax currently does not support multi label outputs. I have very poor convergence with sigmoid loss."

I'd try multi-task learning, the trick above, or even a single MC problem and see whether convergence has better shape. This because the poor convergence issue may come from other elements in the net parameters, such as learning rate, mistake in the prototxt etc...

The type of nets you are describing is arising more often in my own applications too.

Em.

Youssef Kashef

unread,

Nov 10, 2015, 3:55:59 AM11/10/15

to Caffe Users, be...@droidnik.fr

Hello Emmanuel,

thanks for the post. I'm interested in multi-task learning and found the overview very helpful.

I've trained training a conv. net on multiple tasks by adding an inner-product layer + loss layer for each task. I don't think that Caffe supports constructing the label vector manually and adding 1's to allow for two classes to coexist. But I'm not sure if this is going to be any better than the former setup with ip+loss layers per task.

I'm concerned about how the gradients are computed when a weight is influence by two loss functions. So the weights of the ip for each class is only influenced by the loss for this class. Once you go further backwards and arrive at a layer that is shared between both "branches" the update of the layer will be the summation of partial derivatives From tracing the backward pass, it seems that the gradient for a shared weight is updated for 'loss1', before propagating backwards further, it'll compute loss2 and propagate that backwards to that shared layer. Bascially adding another term to the updated weight.

Does this sound right to you? Others are welcome to chime in of course.

Thanks again for the post and sharing your wrapper code,

Youssef

Emmanuel Benazera

unread,

Nov 10, 2015, 4:27:07 AM11/10/15

to Caffe Users, be...@droidnik.fr

On Tuesday, November 10, 2015 at 9:55:59 AM UTC+1, Youssef Kashef wrote:

I've trained training a conv. net on multiple tasks by adding an inner-product layer + loss layer for each task. I don't think that Caffe supports constructing the label vector manually and adding 1's to allow for two classes to coexist. But I'm not sure if this is going to be any better than the former setup with ip+loss layers per task.

Hi Youssef,

Not certain I do understand what you mean by adding 1's manually here.

I'm concerned about how the gradients are computed when a weight is influence by two loss functions. So the weights of the ip for each class is only influenced by the loss for this class. Once you go further backwards and arrive at a layer that is shared between both "branches" the update of the layer will be the summation of partial derivatives From tracing the backward pass, it seems that the gradient for a shared weight is updated for 'loss1', before propagating backwards further, it'll compute loss2 and propagate that backwards to that shared layer. Bascially adding another term to the updated weight.

This sounds a bit like the GoogleNet implementation in the Caffe repository: three losses are composed backward in the gradients (though this is all for a single task). Each loss can be attributed a weight, so the overall effect of each task can indeed be (manually) accommodated. So it does sound right to me, but I haven't tried it yet outside of some modifications to the original GoogleNet.

Em.

Youssef Kashef

unread,

Nov 10, 2015, 4:41:43 AM11/10/15

to Caffe Users, be...@droidnik.fr

Hello Em,

re-adding 1's manually:

I was referring to Shravan's comment "...I am encoding my output labels in binary vector format like [ 0 0 0 0 1 1 1 0 ...]. 0 meaning inactive and 1 being an active class."

Concatenating the label vector of each task into a single label vector. Currently an MNIST class of 4 is converted into a vector of zeros where the element 4 (subtract 1 for zero-based indexing) is equal to 1. Class labels are converted into binary vector with a single 1 to indicate the class. One way to support multiple tasks could have been skipping this vectorization of the class label and constructing it manually. You would select which elements are active. The norm would reflect the error in all tasks.

Youssef

mintaka

unread,

Nov 10, 2015, 1:27:27 PM11/10/15

to Caffe Users, be...@droidnik.fr

Hi Vimal,

I'm not sure if I understanding you point correctly. But isn't the shuffling is set as false by default in caffe.proto (https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L1216)? I think for two sources (one for label, the other one for image), as long as you align them well in data preparation and set the batch sizes as the same, they should be good to use by default.

Vimal Thilak

unread,

Nov 10, 2015, 4:56:55 PM11/10/15

to Caffe Users, be...@droidnik.fr

Hi mintaka,

You are correct. My concern is that as a framework user that's an additional detail that I need to take care of before I use caffe to train a model. What I do is create a text file with input file names and shuffle it and use the shuffled output to create my labels. This works well but a bigger concern is that the order of the inputs are fixed at the time training starts (as opposed to inputs getting shuffled after training has run for an epoch) which creates another bias in my training.

On Tuesday, November 10, 2015 at 10:27:27 AM UTC-8, mintaka wrote:

Hi Vimal,

I'm not sure if I understanding you point correctly. But isn't the shuffling is set as false by default in caffe.proto (https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L1216)? I think for two sources (one for label, the other one for image), as long as you align them well in data preparation and set the batch sizes as the same, they should be good to use by default.

<snip>

Yotam Hechtlinger

unread,

Nov 10, 2015, 9:45:50 PM11/10/15

to Caffe Users, be...@droidnik.fr

It is also possible to do Multi-Label classification in Caffe training a single network, and using the SigmoidCrossEntropyLoss layer to calculate the loss.

The output will be vectors with different probabilities for each label. It makes the computation much faster than 2^n classifiers.

Data should be entering Caffe from 2 different data bases - one with the images and other with the labels. This time a label is a vector 0's and 1's, and should be in the dimensions of (n,k,1,1), where n is the batch size and k is the number of labels.

Lixin Duan

unread,

Nov 11, 2015, 1:17:36 AM11/11/15

to Yotam Hechtlinger, Caffe Users, be...@droidnik.fr

Besides training models for multi-label classification, I think what is also interesting is how to make predictions of a test sample. For multi-class classification (e.g., the standard GoogleNet, CaffeNet models), it makes sense to just assign the class with the highest probability. But this strategy would fail miserably in the context of multi-label classification.

--
You received this message because you are subscribed to a topic in the Google Groups "Caffe Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/caffe-users/RuT1TgwiRCo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/568dcec1-648e-49d2-9279-d432ff56b858%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

taras sereda

unread,

Nov 11, 2015, 4:22:13 PM11/11/15

to Caffe Users, hecht...@gmail.com, be...@droidnik.fr

Hi Mitaka. Why do you think that taking say top-10 most probable classes would fail for multi-label classification?

if what we have as output are values which we can treat as probabilities for each label being assigned to input x.

Can you explain your concerns?

taras sereda

unread,

Nov 11, 2015, 4:40:09 PM11/11/15

to Caffe Users, be...@droidnik.fr

Hi Yotama, have you tried this approach?

Or may be you have samples of prototxt for this type of setting.

And how to combine 2 databases in caffe?

I've implemented similar approach in Torch.

But instead of SigmoidCrosEntropy i've used Softmax CrosEntropy and as target I've used normalised binary encoded vector [0,1,1,0,1] / count(1)

Unfortunately the results were not so impressive. May be it was due to the fact of having too saturated vector of 1's resulting in low values for each probability.

taras sereda

unread,

Nov 11, 2015, 5:22:35 PM11/11/15

to Caffe Users, be...@droidnik.fr

Emm. Thanks for this great review of non classic NN based tasks!

Can you provide any example of multi-label multi-task?

You mean that in this way labels have huge variation, right? Say having 10 tasks with possible values for each, it's possible to have examples with less than 10 tasks.

mintaka

unread,

Nov 11, 2015, 5:39:00 PM11/11/15

to Caffe Users, hecht...@gmail.com, be...@droidnik.fr

Hi Taras, sorry about being inexplicit. Consider a test image which has two labels (e.g., car and person). The output of GoogleNet will be a vector of prediction scores (sum is normalized to be 1). It is very likely that the prediction scores of those two classes are 0.44 and 0.52, with 0.04 being for the remaining classes. Let's see we have set a threshold for each class, which will be used to determine the predicted class(es) of a given test image. Let's say the thresholds for car and person can be some high value, e.g., 0.9, (this high value is to make sure we have high precision). In this case, our test image will be classified as neither car nor person.

Message has been deleted

Yotam Hechtlinger

unread,

Nov 12, 2015, 12:41:57 AM11/12/15

to Caffe Users, hecht...@gmail.com, be...@droidnik.fr

When you switch the layer to SigmoidCrossEntropyLoss the output of the classifier is a vector, in the sense that each coordinate has a different value of giving prediction to this specific class.

The output doesn't sum to 1 anymore. To do the classification you just threshold everything >0.5 as 1 and otherwise 0. You can also treat the threshold as a tuning parameter, and adjust it accordingly. An example of the important change in the prototxt is:

layer {

type: "Data"

top: "data"

include {

phase: TRAIN

}

transform_param {

mirror: true

crop_size: 224

mean_value: 104.0

mean_value: 117.0

mean_value: 123.0

}

data_param {

source: "DB_LOCATION/train_data_lmdb"

batch_size: 10

backend: LMDB

}

layer {

type: "Data"

top: "data_label"

include {

phase: TRAIN

}

data_param {

source: "DB_LOCATION/train_score_lmdb"

batch_size: 10

backend: LMDB

}

and than at the end:

layer {

type: "SigmoidCrossEntropyLoss"

bottom: "fc8_changed"

bottom: "data_label"

top: "loss"

loss_weight: 1

}

During the test phase you should just use the Sigmoid layer without the loss.

Again (important) - The labels should be saved in the LMDB as (n,k,1,1), where n is the batch size and k is the number of classes. The labels and the data itself order must be the same when the DB's are created.

Emmanuel Benazera

unread,

Nov 12, 2015, 3:31:13 AM11/12/15

to Caffe Users, be...@droidnik.fr

My experience with multi-labels in other (non-NN) settings correlates with that of Mintaka: thresholding is adhoc and noisy. I haven't yet tried with SigmoidCrossEntropy though. My main interest at this stage would be to better measure whether multi-task learning (MTL) has better accuracy in practice than multi-label, possibly with SigmoidCrossEntropy. If there's a relevant public multi-label dataset (any smaller than StreetView below ? :) ), I'd be happy to give it a try. My current hunch is that MTL should provide better discrimination, possibly at the cost of fitting a larger net, due to the set of final FC layers.

Typically, the work on multi-digit number recognition from street view imagery (http://arxiv.org/abs/1312.6082) does seem to put such a MTL setting in practice.

Regarding the multi-label MTL setting, I would rather have measured the multi-label / MTL comparison above before digging into it. Maybe it is a good setting whenever the number of final softmax would be to high for the fit, and they could then be replaced with a series of SigmoidCrossEntropy layers. Once thing that comes to mind in this particular setting is the ability to do MTL over series of overlapping labels, thus providing a way to easily perform ensembling over the results...

Em.

taras sereda

unread,

Nov 12, 2015, 6:27:10 AM11/12/15

to Caffe Users, be...@droidnik.fr

Hi Emm.

There is one know open multi-label dataset, Mirflickr.

http://press.liacs.nl/mirflickr/

I've tried this dataset with Multimodal RBM described hear:

http://www.cs.toronto.edu/~nitish/multimodal/

taras sereda

unread,

Nov 12, 2015, 8:48:22 AM11/12/15

to Caffe Users, hecht...@gmail.com, be...@droidnik.fr

Mintaka thanks for answering, consider the following case.

Taking the example you proposed it's possible to formulate loss function as CrossEntropyLoss where you get probabilities separate for each class.

Now he result is not summing up to 1, but you have ranges of values [0,1] for each class. Hear the same idea of thresholding could be applied.

And on the test stage as @Yotam said the last layer after FC would be Sigmoid. for squashing up values between [0,1].

Manuele Tamburrano

unread,

Nov 12, 2015, 9:33:43 AM11/12/15

to Caffe Users, be...@droidnik.fr

Hi,
I've just submitted a PR to adapt SoftmaxWithLossLayers and AddMatVector to MultiClass problems.
The gpu part is not working and I don't know if I've enough spare time to fix that, but the cpu part should work.

If you are interested or if you want to help to fix the last issues, the PR is here: https://github.com/BVLC/caffe/pull/3326

taras sereda

unread,

Nov 28, 2015, 12:43:12 PM11/28/15

to Caffe Users, hecht...@gmail.com, be...@droidnik.fr

Yotam thanks for the example of prototxt.

I've tried the following way of creating lmdbs. And I'm using the examples you provided in prototxt for reading from 2 separate lmdbs and SigmodCrossEntropy as loss function. I expect to see different rates of loss changes for training from scratch and finetuning. But they are nearly the same. Details are bellow.

Also I have one question. Does it makes any difference to put in values in LMDB in this way N_batches*(Batch_s,k,w,h) rather than (N_batches*Batch_s,k,w,h).

in case of labels N_batches*(Batch_s,voc_length,1,1) or (N_batches*Batch_s,voc_length,1,1)?

I've tried both ways. And I observed that loss changes consistently when data is in N_batches*(Batch_s,k,w,h) , when all the data points are in Batch_s dimension changes of loss oscillates. Which is strange for me.

May be I missed something? Thanks in advance.

Enter code here...

lmdb_data_name = 'test_data_lmdb'
lmdb_label_name = 'test_score_lmdb'




Inputs = []
Labels = []


 for line in fileinput.input(data):
     entries = re.split(' ', line.strip())
     Inputs.append(entries[0])
     Labels.append(entries[1])


b_size = 4
print('Writing labels')
for idx in range(int(math.ceil(len(Labels)/(1.0*b_size)))):
    in_db_label = lmdb.open(lmdb_label_name, map_size=int(1e12))
    with in_db_label.begin(write=True) as in_txn:
        for label_idx, label_ in enumerate(Labels[(b_size*idx):(b_size*(idx+1))]):
            im_dat = caffe.io.array_to_datum(np.array(label_).astype(float).reshape(len(label_),1,1))
            in_txn.put('{:0>10d}'.format(b_size*idx + label_idx), im_dat.SerializeToString())


            string_ = str(b_size*idx+label_idx+1) + ' / ' + str(len(Labels))
            sys.stdout.write("\r%s" % string_)
            sys.stdout.flush()
    in_db_label.close()
print('')


print('Writing image data')


for idx in range(int(math.ceil(len(Inputs)/(1.0*b_size)))):
    in_db_data = lmdb.open(lmdb_data_name, map_size=int(1e12))
    with in_db_data.begin(write=True) as in_txn:
        for in_idx, in_ in enumerate(Inputs[(b_size*idx):(b_size*(idx+1))]):
            im = caffe.io.load_image(in_)
            im_dat = caffe.io.array_to_datum(im.astype(float).transpose((2, 0, 1)))
            in_txn.put('{:0>10d}'.format(b_size*idx + in_idx), im_dat.SerializeToString())


            string_ = str(b_size*idx+in_idx+1) + ' / ' + str(len(Inputs))
            sys.stdout.write("\r%s" % string_)
            sys.stdout.flush()
    in_db_data.close()
print('')

But when I start training the loss drops rapidly when I train from scratch:

I1128 17:21:14.422446 25826 solver.cpp:236] Iteration 0, loss = 4557.88
I1128 17:21:14.422489 25826 solver.cpp:252]     Train net output #0: loss = 4557.88 (* 1 = 4557.88 loss) 
I1128 17:21:14.422513 25826 sgd_solver.cpp:106] Iteration 0, lr = 1e-05 
I1128 17:21:16.656311 25826 solver.cpp:236] Iteration 20, loss = 3805.06 
I1128 17:21:16.656376 25826 solver.cpp:252]     Train net output #0: loss = 3805.06 (* 1 = 3805.06 loss) 
I1128 17:21:16.656393 25826 sgd_solver.cpp:106] Iteration 20, lr = 1e-05 
I1128 17:21:18.886127 25826 solver.cpp:236] Iteration 40, loss = 346.539 
I1128 17:21:18.886193 25826 solver.cpp:252]     Train net output #0: loss = 346.539 (* 1 = 346.539 loss) 
I1128 17:21:18.886209 25826 sgd_solver.cpp:106] Iteration 40, lr = 1e-05 
I1128 17:21:21.115128 25826 solver.cpp:236] Iteration 60, loss = 290.139 
I1128 17:21:21.115190 25826 solver.cpp:252]     Train net output #0: loss = 290.139 (* 1 = 290.139 loss)

And the same rate of change for loss is when I do finetuning

I1128 17:12:29.984871 25734 solver.cpp:288] Learning Rate Policy: step
I1128 17:12:30.073063 25734 solver.cpp:236] Iteration 0, loss = 5421.64
I1128 17:12:30.073132 25734 solver.cpp:252]     Train net output #0: loss = 5421.64 (* 1 = 5421.64 loss)
I1128 17:12:30.073166 25734 sgd_solver.cpp:106] Iteration 0, lr = 1e-05
I1128 17:12:32.307703 25734 solver.cpp:236] Iteration 20, loss = 3074.11
I1128 17:12:32.307770 25734 solver.cpp:252]     Train net output #0: loss = 3074.11 (* 1 = 3074.11 loss)
I1128 17:12:32.307796 25734 sgd_solver.cpp:106] Iteration 20, lr = 1e-05
I1128 17:12:34.540082 25734 solver.cpp:236] Iteration 40, loss = 305.52
I1128 17:12:34.540153 25734 solver.cpp:252]     Train net output #0: loss = 305.52 (* 1 = 305.52 loss)
I1128 17:12:34.540177 25734 sgd_solver.cpp:106] Iteration 40, lr = 1e-05
I1128 17:12:36.772572 25734 solver.cpp:236] Iteration 60, loss = 293.924

包青平

unread,

Nov 30, 2015, 3:45:57 AM11/30/15

to Caffe Users

I have tired this script to generate lmdb dataset, and I come to the same problem. You could tried to generate lmdb of labels with this script , and image lamdb with conver_imageset in caffe . By this way, I cound do a multi-label regression. And my train_val.prototxt is from sukritshankar's answer in https://github.com/BVLC/caffe/issues/2407 .

在 2015年11月29日星期日 UTC+8上午1:43:12，taras sereda写道：

...

Message has been deleted

Oscar Beijbom

unread,

Jan 9, 2016, 3:50:09 PM1/9/16

to Caffe Users

I have made a pull-request that demonstrates how to do multi-label classification using a python data-layer and the sigmoid cross entropy loss.

https://github.com/BVLC/caffe/pull/3471

Jeremy Rutman

unread,

Jun 18, 2016, 9:15:26 AM6/18/16

to Caffe Users, hecht...@gmail.com, be...@droidnik.fr

Is this still valid?

I have a multi-label problem with 21 possible labels that I encode with the 'many-hot' method , e.g. [0 1 0 1 1 0 0...]
If I don't pad the label vector then its shape matches the output shape but I hit a dimension problem :

0618 07:43:18.221647 21547 net.cpp:141] Setting up data
I0618 07:43:18.221740 21547 net.cpp:148] Top shape: 1 3 227 227 (154587)
I0618 07:43:18.221766 21547 net.cpp:148] Top shape: 1 21 (21)
...
I0618 07:43:19.005381 21547 net.cpp:141] Setting up myfc8
I0618 07:43:19.005405 21547 net.cpp:148] Top shape: 1 21 (21)
...
I0618 07:43:19.005576 21547 layer_factory.hpp:77] Creating layer loss

F0618 07:43:19.005725 21547 softmax_loss_layer.cpp:47] Check failed: outer_num_ * inner_num_ == bottom[1]->count() (1 vs. 21) Number of labels must match number of predictions; e.g., if softmax axis == 1 and prediction shape is (N, C, H, W), label count (number of labels) must be N*H*W, with integer values in {0, 1, ..., C-1}.

and padding as (n,k,1,1) gives the same error

I0618 07:48:56.920414 21684 net.cpp:141] Setting up data
I0618 07:48:56.920501 21684 net.cpp:148] Top shape: 1 3 227 227 (154587)
I0618 07:48:56.920518 21684 net.cpp:148] Top shape: 1 21 1 1 (21)
...
I0618 07:48:57.701524 21684 net.cpp:141] Setting up myfc8
I0618 07:48:57.701550 21684 net.cpp:148] Top shape: 1 21 (21)
...
I0618 07:48:57.701715 21684 layer_factory.hpp:77] Creating layer loss
F0618 07:48:57.701869 21684 softmax_loss_layer.cpp:47] Check failed: outer_num_ * inner_num_ == bottom[1]->count() (1 vs. 21) Number of labels must match number of predictions; e.g., if softmax axis == 1 and prediction shape is (N, C, H, W), label count (number of labels) must be N*H*W, with integer values in {0, 1, ..., C-1}.

my loss layer is the standard:

layer {
  name: "loss"
  type: "SigmoidCrossEntropy"
  bottom: "myfc8"
  bottom: "label"
  top: "loss"
}

I tried various other pads to no avail.

The error message doesn't jibe with the layer description which wants y = [0,1] while the error message wants y=[0...C-1]

On reading the former however it seems the prediction shape coming from an fc layer of C outputs also needs to be reshaped to [n, k, 1, 1].

After doing this the error is gone.

The accuracy layer is (predictably) no longer functional so I took it out , and the 'MultiLabelAccuracy' which I found referenced somewhere does not seem to be implemented in the main branch.

Reply all

Reply to author

Forward