Clustering faces with FC7 features

810 views
Skip to first unread message

Edmond

unread,
Oct 28, 2015, 2:53:19 AM10/28/15
to Caffe Users
Hi, I have a collection of faces (1000's) that I would like to group together automatically.

My approach is to first extract facial features (FC7, VGG face descriptor) then group faces using a clustering method such as kmeans.

However results so far aren't promising, i.e. too many unrelated faces are grouped together.

My questions are:

1) Are FC7 layer features suitable for face clustering?

2) What distance metrics should I use? I tried all the standard approaches such as Euclidean, Cosine, CityBlock, etc, with average results.

Any help would be much appreciated.

ath...@ualberta.ca

unread,
Oct 28, 2015, 12:41:08 PM10/28/15
to Caffe Users
Hi Edmond,

If you extract image representations (reps) from convolutional layers, they inherently contain spatial information. Each fully connected layer inherently scatters (removes) spatial information.

You can think of convolutional rep responses being "tuned-into" performing spatial comparisons across images in addition to just "existence" responses, say. So not just whether an eye appears or the color of shape of an eye but the position of the eye. Fully connected reps will have a great deal of this spatial information removed but will contain more abstractions so it's a trade-off situation.

If your images (faces) are somewhat registered (eyes, mouth etc. are at about same locations across images), then you will find the convolutional ones will perform much better for you. For AlexNet architecture as an example, I highly recommend pool 5 layer.

In general try the last convolutional layer rep (after pooling) if your images are registered. 

Fully connected layer reps will perform better the less registration you have in an image (like crops of cats where the cat can be in any pose and cat parts anywhere in the cropped image). In this case the rep will benefit from increased abstraction and increased spatial invariance.

Hope this helps.

Regards,
Andy Hess

ath...@ualberta.ca

unread,
Oct 28, 2015, 12:45:08 PM10/28/15
to Caffe Users
Regarding metrics, I have found (and have read in many papers) that cosine distance performs at least as good as anything else (and nice and fast). Your representation search is far more important that metric search so I would just stick with cosine distance and find the best representation.

Edmond

unread,
Oct 29, 2015, 2:00:47 AM10/29/15
to Caffe Users
Hi Andy,

Thank you so much for taking the time to reply my post. Your posts are insightful and I will definitely try out Pool 5 features!

Regarding my problem, I found this:

Oxford's VGG Face Descriptor ... Their softmax model doesn't embed features like FaceNet, which makes tasks like classification and clustering more difficult. 

Interesting.

Thanks,
Edmond

ath...@ualberta.ca

unread,
Oct 31, 2015, 1:42:02 PM10/31/15
to Caffe Users
Yes, this is true. FaceNet is a spectacular, state of the art paper - a must read for anyone in deep learning.

However, it is an example of how industry and academia are pulling away from each other in deep learning. In other words, if you are a grad student, there is generally no way you will get access to the quantity/quality of train data they have not to mention the computing resources needed to reproduce these results (nevermind beating it). Intern at Google/Facebook would work though.

These embeddings are the way of the future and far superior to extracting representations from nets as discussed above IMO. "Gardening" for representations by training a net with some data for a given task and then picking out internal reps (then throwing them into 1-vs-all SVMs or whatever) is the standard way to go over the last couple of years. 

Lacking sufficient train data (as you do), someone else then, has grown the garden (pre-trained the net) and you just pick out the representations that work the best. For example, on the dogs vs cats dataset (Kaggle), this simple approach reaches 97% or so which is still very effective. If you think about how the AlexNet feature garden was grown (classification task of 1000 classes), then of course you cannot expect it to do anywhere as good as FaceNet (learning embeddings). But it's a good exercise in any case for a CNN course or as a learning exercise. 

mrgloom

unread,
Nov 27, 2015, 10:58:05 AM11/27/15
to Caffe Users
did you tried t-sne to visualize your data?
http://cs.stanford.edu/people/karpathy/cnnembed/


среда, 28 октября 2015 г., 9:53:19 UTC+3 пользователь Edmond написал:

Mahfujur Rahman

unread,
Aug 17, 2017, 11:55:54 AM8/17/17
to Caffe Users

did you tried t-sne to visualize your data?



Mahfujur Rahman

unread,
Aug 19, 2017, 12:46:02 AM8/19/17
to Caffe Users
@mrgloom Did you find out the way to visualize your data using t-SNE?


On Saturday, November 28, 2015 at 1:58:05 AM UTC+10, mrgloom wrote:
Reply all
Reply to author
Forward
0 new messages