Thanks Evan and Kai's replies. I must have missed something obvious. The "fully convolutional..." paper states that "We append a 1 x 1 convolution with channel
dimension 21 to predict scores for each of the PASCAL classes (including background) at each of the coarse output locations, followed by a deconvolution layer to bilinearly upsample the coarse outputs to pixel-dense outputs". That seems to say the input image blob of SoftMaxWithLoss should be: BatchSize x channel size (21) x H x W.
I also check the code in 'voc-fcn32s/net.py' and found:
n.score_fr = L.Convolution(n.drop7, num_output=21, kernel_size=1, pad=0,
param=[dict(lr_mult=1, decay_mult=1), dict(lr_mult=2, decay_mult=0)])
n.upscore = L.Deconvolution(n.score_fr,
convolution_param=dict(num_output=21, kernel_size=64, stride=32,
bias_term=False),
param=[dict(lr_mult=0)])
n.score = crop(n.upscore, n.data)
n.loss = L.SoftmaxWithLoss(n.score, n.label,
loss_param=dict(normalize=False, ignore_label=255))
which also seem to say the same thing. Then, if label has shape batchSize x 1 x H x W, then label shape does not match input image shape.
Actually I tried both cases for label shape: 1) batchSize x 1 x H x W (each entry is a integer for class), and 2) batchSize x numClasses x H x W (entry is binary, one hot). Both give me huge amount of "test net output #" screen logs then out of memory errors.
Any insights? ( guess I can change the output verbose level somehow but that's another question)