LFW Face Recognition in Lasagne - Overfitting?

188 views
Skip to first unread message

Raj Shah

unread,
Mar 9, 2017, 10:50:29 AM3/9/17
to lasagne-users
Hi guys,

I began to implement and extend Daniel Nouri's facial keypoints detection tutorial to a face recognition task on the LFW dataset. I know the dataset is pretty small (I've also reduced it have at least 4 photos/faces per person), but the goal of my project is to see what results I can obtain by training on a small dataset as LFW.

The results so far have been disappointing. I believe my net is overfitting - train loss almost 0 while validation loss high and climbing. What can I do to tackle this?

I've experimented with different network definitions (shallower as well) and no luck. It leads me to believe that maybe something's wrong with my dataset, and/or how I'm presenting it to my net?

Current network architecture: 8 convolution layers (with nonlinearities.rectify), each followed by a pooling layer, lastly 3 fully connected layers (acting as MLP classifier), topped off with a softmax nonlinearity in my output layer. I have also included dropout, but no luck.
Along this, I have data augmentation processes - shuffling and flipping images, to tackle possible overfitting, but again, no luck.
I run it for a 1000 epochs with early stopping (like in dnouri's tutorial).

Graphs/curves generated, see here. (from a previous architecture I tried implementing; curves are very similar, if not the same).

You can see all this in the code here below.

import theano
import os
import numpy as np 
from PIL import Image
import matplotlib.pyplot as plt

from lasagne import layers, nonlinearities
from lasagne.updates import nesterov_momentum
from nolearn.lasagne import NeuralNet
from sklearn.datasets import fetch_lfw_people
from sklearn.utils import shuffle
from nolearn.lasagne import BatchIterator

try:
    from lasagne.layers.cuda_convnet import Conv2DCCLayer as Conv2DLayer
    from lasagne.layers.cuda_convnet import MaxPool2DCCLayer as MaxPool2DLayer
except ImportError:
    Conv2DLayer = layers.Conv2DLayer
    MaxPool2DLayer = layers.MaxPool2DLayer

# reshape dataset/images
def reshapeDataset(dataset, newWidth, newHeight):
new_dataset = []
for data in dataset:
size = (newWidth, newHeight)
img = Image.fromarray(data)
img = img.resize(size)
img = np.array(img, dtype=np.float32)
new_dataset.append(img[np.newaxis, :, :])
return np.array(new_dataset) / 255

class PyroBatchIterator(BatchIterator):
  def imageaugment(self, Xb, yb):
Xb, yb = super(PyroBatchIterator, self).transform(Xb, yb)
# flip half of the images in this batch at random:
bs = Xb.shape[0]
indices = np.random.choice(bs, bs / 2, replace=False)
Xb = flipImage(Xb,indices)
return Xb, yb

class EarlyStopping(object):
    def __init__(self, patience=100):
        self.patience = patience
        self.best_valid = np.inf
        self.best_valid_epoch = 0
        self.best_weights = None

    def __call__(self, nn, train_history):
        current_valid = train_history[-1]['valid_loss']
        current_epoch = train_history[-1]['epoch']
        if current_valid < self.best_valid:
            self.best_valid = current_valid
            self.best_valid_epoch = current_epoch
            self.best_weights = nn.get_all_params_values()
        elif self.best_valid_epoch + self.patience < current_epoch:
            print("Early stopping.")
            print("Best valid loss was {:.6f} at epoch {}.".format(
                self.best_valid, self.best_valid_epoch))
            nn.load_params_from(self.best_weights)
            raise StopIteration()



# acquire dataset
lfw_people = fetch_lfw_people(funneled=True, min_faces_per_person=4, resize=0.4)

# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape

# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = reshapeDataset(lfw_people.images, 180, 180)
h, w = 180, 180

# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
nb_class = len(target_names)

#X = np.asarray(X)
y = np.asarray(y, dtype=np.int32)

X, y = shuffle(X, y, random_state = 42)

# print dataset info
print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_classes: %d" % n_classes)

# print shape of X and y presented to net
print X.shape
print y.shape

net3 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('pool1', layers.MaxPool2DLayer),
('conv2', layers.Conv2DLayer),
('pool2', layers.MaxPool2DLayer),
('conv3', layers.Conv2DLayer),
('pool3', layers.MaxPool2DLayer),
('conv4', layers.Conv2DLayer),
('pool4', layers.MaxPool2DLayer),
('conv5', layers.Conv2DLayer),
('pool5', layers.MaxPool2DLayer),
('conv6', layers.Conv2DLayer),
('pool6', layers.MaxPool2DLayer),
('conv7', layers.Conv2DLayer),
('pool7', layers.MaxPool2DLayer),
('dropout1', layers.DropoutLayer),
('conv8', layers.Conv2DLayer),
('pool8', layers.MaxPool2DLayer),
('dropout2', layers.DropoutLayer),
('hidden9', layers.DenseLayer),
('dropout3', layers.DropoutLayer),
('hidden10', layers.DenseLayer),
('output', layers.DenseLayer),
],
# layer params
input_shape = (None, 1, 180, 180),
conv1_num_filters = 64, conv1_filter_size = (3, 3), conv1_stride = 1, conv1_pad = 1,
conv1_nonlinearity = nonlinearities.rectify, 
pool1_pool_size = (2, 2),
conv2_num_filters = 64, conv2_filter_size = (3, 3), conv2_stride = 1, conv2_pad = 1,
conv2_nonlinearity = nonlinearities.rectify,
pool2_pool_size = (2, 2),

conv3_num_filters = 128, conv3_filter_size = (3, 3), conv3_stride = 1, conv3_pad = 1,
conv3_nonlinearity = nonlinearities.rectify,
pool3_pool_size = (2, 2),
conv4_num_filters = 128, conv4_filter_size = (3, 3), conv4_stride = 1, conv4_pad = 1,
conv4_nonlinearity = nonlinearities.rectify,
pool4_pool_size = (2, 2),
conv5_num_filters = 256, conv5_filter_size = (3, 3), conv5_stride = 1, conv5_pad = 1,
conv5_nonlinearity = nonlinearities.rectify,
pool5_pool_size = (2, 2),

conv6_num_filters = 256, conv6_filter_size = (3, 3), conv6_stride = 1, conv6_pad = 1,
conv6_nonlinearity = nonlinearities.rectify,
pool6_pool_size = (2, 2),

conv7_num_filters = 256, conv7_filter_size = (2, 2), conv7_stride = 1, conv7_pad = 1,
conv7_nonlinearity = nonlinearities.rectify,
pool7_pool_size = (2, 2),
dropout1_p = 0.5,

conv8_num_filters = 512, conv8_filter_size = (2, 2), conv8_stride = 1, conv8_pad = 1,
conv8_nonlinearity = nonlinearities.rectify,
pool8_pool_size = (2, 2),
dropout2_p = 0.5,

hidden9_num_units = 500, 
dropout3_p = 0.5,
hidden10_num_units = 500,

output_num_units = 610, output_nonlinearity = nonlinearities.softmax,
#update=nesterov_momentum,
update_learning_rate = 0.001,
update_momentum = 0.9,

regression = False, # not dealing with regression problem
batch_iterator_train = PyroBatchIterator(batch_size = 50),

on_epoch_finished=[
        EarlyStopping(patience=200),
        ],

max_epochs = 1000, # want to train this many epochs
verbose = 1,
eval_size = 0.2
)

net3.fit(X, y)

# test/plot curves
train_loss = np.array([i["train_loss"] for i in net3.train_history_])
valid_loss = np.array([i["valid_loss"] for i in net3.train_history_])
plt.plot(train_loss, linewidth=3, label="train")
plt.plot(valid_loss, linewidth=3, label="valid")
plt.grid()
plt.legend()
plt.xlabel("epoch")
plt.ylabel("loss")
plt.yscale("log")
plt.show()

# training for 1000 epochs will take a while.
# pickle the trained model so that we can load it back later:
#import cPickle as pickle
#with open('net3.pickle', 'wb') as f:
# pickle.dump(net3, f, -1)

Could someone guide me in the right direction? Should I be worried that training and validation losses begin at 6+? Please let me know if you want to have a look at what I see in terminal once training begins.
All help much appreciated!
Thank you!

Raj Shah

unread,
Mar 9, 2017, 11:06:24 AM3/9/17
to lasagne-users
I began to implement and extend Daniel Nouri's facial keypoints detection tutorial to a face recognition task on the LFW dataset. I know the dataset is pretty small (I've also reduced it have at least 4 photos/faces per person), but the goal of my project is to see what results I can obtain by training on a small dataset as LFW.

For those not familiar with the tutorial mentioned: Facial Keypoints Detection Tutorial

Jan Schlüter

unread,
Mar 9, 2017, 11:23:18 AM3/9/17
to lasagne-users
Did you try replicating Daniel's results 1:1 (original dataset and architecture)? I'd suggest to start there and take small steps, not change too much at once.

Raj Shah

unread,
Mar 9, 2017, 11:34:06 AM3/9/17
to lasagne-users
Hi Jan, thank you so much for replying! 

Yes, I did. Similar results, if not the same. I can also assure there are no problems with the environment -> Ubuntu, with a GTX 970.
However, I did try to implement his network architecture(s) with my dataset (with appropriate changes), but no luck there either.

Do you reckon there's a problem with how I present my data + augmentation?

Jan Schlüter

unread,
Mar 9, 2017, 2:09:53 PM3/9/17
to lasagne-users
However, I did try to implement his network architecture(s) with my dataset (with appropriate changes), but no luck there either.

Do you reckon there's a problem with how I present my data + augmentation?

Could be, but I guess it's more probable that the dataset is too small / too difficult / too different to be learned with the same architecture and hyperparameters. Keypoint detection and face recognition/identification aren't that closely related. Have a look at some face recognition papers and build from there, possibly try replicating some results. Could be a nice contribution to https://github.com/Lasagne/Recipes as well, although https://github.com/Kadenze/siamese_net also already gives some example implementations.

Best, Jan

Raj Shah

unread,
Mar 10, 2017, 2:54:50 PM3/10/17
to lasagne-users
Hi Jan, thanks for your help. I have taken your advice and settled on this simpler architecture. I was hoping you, or anyone else here, could help me interpret the results? 

Like before, data augmentation involved - flipping and shuffling - however, dataset reduced to min 10 faces/person -> total samples = 4324, classes = 158, and images/photos resized to 64x64. 
Training with batch size = 128.

net architecture: 
net3 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('conv2', layers.Conv2DLayer),
('pool2', layers.MaxPool2DLayer),
('conv3', layers.Conv2DLayer),
('conv4', layers.Conv2DLayer),
('pool4', layers.MaxPool2DLayer),
('conv5', layers.Conv2DLayer),
('pool5', layers.MaxPool2DLayer),
('hidden9', layers.DenseLayer),
('dropout5', layers.DropoutLayer),
('hidden10', layers.DenseLayer),
('dropout6', layers.DropoutLayer),
('hidden11', layers.DenseLayer),
('dropout7', layers.DropoutLayer),
('output', layers.DenseLayer),
],
# layer params
input_shape = (None, 1, 64, 64),
conv1_num_filters = 128, conv1_filter_size = (3, 3), conv1_stride = 1, conv1_pad = 1,
conv1_nonlinearity = nonlinearities.rectify, 
conv2_num_filters = 128, conv2_filter_size = (3, 3), conv2_stride = 1, conv2_pad = 1,
conv2_nonlinearity = nonlinearities.rectify,
pool2_pool_size = (2, 2), pool2_stride = 2,

conv3_num_filters = 128, conv3_filter_size = (3, 3), conv3_stride = 1, conv3_pad = 1,
conv3_nonlinearity = nonlinearities.rectify,
conv4_num_filters = 128, conv4_filter_size = (3, 3), conv4_stride = 1, conv4_pad = 1,
conv4_nonlinearity = nonlinearities.rectify,
pool4_pool_size = (2, 2), pool4_stride = 1,

conv5_num_filters = 128, conv5_filter_size = (3, 3), conv5_stride = 1, conv5_pad = 1,
conv5_nonlinearity = nonlinearities.rectify,
pool5_pool_size = (2, 2), pool5_stride = 1,

hidden9_num_units = 400, 
dropout5_p = 0.5,
hidden10_num_units = 400,

dropout6_p = 0.5,

hidden11_num_units = 200,

dropout7_p = 0.5,

output_num_units = 158, output_nonlinearity = nonlinearities.softmax,
update_learning_rate = theano.shared(float32(0.01)),
update_momentum = theano.shared(float32(0.9)),

regression = False, # not dealing with regression problem
batch_iterator_train = PyroBatchIterator(batch_size = 128),

on_epoch_finished=[
        AdjustVariable('update_learning_rate', start=0.01, stop=0.0001),
        AdjustVariable('update_momentum', start=0.9, stop=0.999),
        EarlyStopping(patience=200),
        ],

max_epochs = 1000, # want to train this many epochs
verbose = 1,
eval_size = 0.2
)

Training beginning, ending, and generated curves here.

Is it correct to say I am still overfitting? Or am I actually achieving 70% validation accuracy? Should I be concerned by the fact that the losses begin at high numbers, and are these poor results?

Like you said, I know I'm working with a small/limited dataset (specially after restrictions above), but my objective is to see what I could get out of training on it. As this is my first CNN project after MNIST, I'm not sure whether this is correct or incorrect, and whether there is room for improvement?

Thanks!
Message has been deleted

Raj Shah

unread,
Mar 12, 2017, 11:35:35 AM3/12/17
to lasagne-users
Hi guys,

This might be a dumb question, but how do I calculate my training and validation error from the training and validation loss numbers generated? (the link above has both values). Reason I ask is so that I can make more sense of the results I generated above, and answer questions like - How to know when to stop? The loss function by default is cross entropy (categorical cross entropy), and I just want to look at percentages and make sense of the validation accuracy.

Thanks!

Jan Schlüter

unread,
Mar 13, 2017, 7:33:59 AM3/13/17
to lasagne-users
Is it correct to say I am still overfitting? Or am I actually achieving 70% validation accuracy?

Both. You're overfitting since the training error still goes down by a lot when the validation error stagnates. But as long as your validation error doesn't go up, you're just wasting time and energy, not really harming the model (there might be room for improvement through better regularization). And yes, you achieve 70% validation accuracy.

Should I be concerned by the fact that the losses begin at high numbers,

Not really, if they go down. If they stay high you may have change hyperparameters (better-scaled weight initialization, smaller learning rate, or duck out by using batch normalization).
 
and are these poor results?

Compare to some papers.

Like you said, I know I'm working with a small/limited dataset (specially after restrictions above), but my objective is to see what I could get out of training on it. As this is my first CNN project after MNIST, I'm not sure whether this is correct or incorrect, and whether there is room for improvement?

Well, the way you train, it will probably only be able to distinguish those 158 people you trained on. Have a look at the repository I linked for a different approach.

This might be a dumb question, but how do I calculate my training and validation error from the training and validation loss numbers generated?

You can't compute the error from the cross-validation loss. You can just compute the classification error from the accuracy. Cross-validation loss is a proxy for the classification error, which can't be minimized directly since it's not differentiable. Please have a look at the literature suggested at http://lasagne.readthedocs.io/en/latest/user/tutorial.html#before-we-start, I cannot teach deep learning over a mailing list :)

Best, Jan
Reply all
Reply to author
Forward
0 new messages