Training: questions

2,225 views
Skip to first unread message

Bartosz Ludwiczuk

unread,
Nov 13, 2015, 6:26:53 AM11/13/15
to CMU OpenFace
Hi Bamos,
I have several question about design of OpenFace training stage. 

  1.  I look at the architecture of nn4 and could you explain me some stuff
    • why no normalization in first layers, as FaceNet use it? (I mean that comment: -- Don't use normalization)
    • do you try to remove 5x5 kernels in nn4? I think that they remove 5x5 convolutions in last two layers, where input field is smaller than 5x5 (for example input to 5a, 5b in 3x3x1024. the input to 4e is 6x6x640). I would delete this 5x5 in last two Inception modules)
    • in paper there in no info about Batch Normalization? Do you think it matter?
  2. You release the model from 177 epoch, but you learn by ~300 epoch. This is because of lower result from the final model (it overfitt)?
  3. Do you think, that any data augmentation technique could provide better result? Like color jittering, jpeg compression?

How are the result after training with Suggestion 1? 

Regards,
Bartosz


Brandon Amos

unread,
Nov 17, 2015, 11:47:26 AM11/17/15
to Bartosz Ludwiczuk, CMU OpenFace
Hi Bartosz,

Thanks for the message, happy you're looking so closely at the code.

> 1. I look at the architecture of nn4 and could you explain me some stuff
> - why no normalization in first layers, as FaceNet use it? (I mean that
> comment: -- Don't use normalization)

The only information the FaceNet paper gives is just `norm` for these portions,
so I'm not sure what kind of normalization they're using.
Do you know?
Maybe something like Torch's SpatialContrastiveNormalization?

I opened an issue on this a month ago to remind myself to
further experiment with the normalization:

https://github.com/cmusatyalab/openface/issues/37

> - do you try to remove 5x5 kernels in nn4? I think that they remove
> 5x5 convolutions in last two layers, where input field is smaller than 5x5
> (for example input to 5a, 5b in 3x3x1024. the input to 4e is 6x6x640). I
> would delete this 5x5 in last two Inception modules)

I don't have any reason for keeping these, will try removing them.

I've opened a new issue thread for training the next model
with this, the normalization, and some alignment issues at:

https://github.com/cmusatyalab/openface/issues/55

I think I'll be able to start training new models with these
in a week or 2.
Do you have any other suggestions before then?

> - in paper there in no info about Batch Normalization? Do you think
> it matter?

I hope not, but haven't run extensive experiments on this.
Also added it to the issue, but I'm not sure how to evaluate this
without training a new model for 2-3+ weeks.

> 2. You release the model from 177 epoch, but you learn by ~300 epoch.
> This is because of lower result from the final model (it overfitt)?

Yes, the later models seemed to overfit and didn't perform as good
on the LFW.
I don't have any better numbers because I deleted the models to
clear up disk space.


> 3. Do you think, that any data augmentation technique could provide
> better result? Like color jittering, jpeg compression?

Possibly, [Krizhevsky 2012] has a good section on this for image classification
where they augment the colors and try random crops.
My intuition is that the random crops aren't good for face recognition
since the alignment preprocessing puts all of the features in
the same location, but color jittering might help.
When I extended the [imagenet-multiGPU.torch] example for OpenFace,
I removed the augmentation code.
Would be interesting to try.

> How are the result after training with Suggestion 1?

Thanks again for fixing this!
Training is clearly faster now, but the accuracy hasn't surpassed
my existing models.
I posted the latest results here:
https://github.com/cmusatyalab/openface/issues/48#issuecomment-157421504



-Brandon.


[imagenet-multiGPU.torch]: https://github.com/soumith/imagenet-multiGPU.torch
[Krizhevsky 2012]: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
signature.asc

Bartosz Ludwiczuk

unread,
Nov 18, 2015, 7:30:27 AM11/18/15
to CMU OpenFace, melg...@gmail.com, ba...@cs.cmu.edu
Hi, 

W dniu wtorek, 17 listopada 2015 17:47:26 UTC+1 użytkownik Brandon Amos napisał:
Hi Bartosz,

The only information the FaceNet paper gives is just `norm` for these portions,
so I'm not sure what kind of normalization they're using.
Do you know?
Maybe something like Torch's SpatialContrastiveNormalization?
I think you are right. I can not figure out any other method. 


And about the training procedure:
I am now implementing the new train procedure. It is rather different than current pipeline and it is following:
1. Define only one model, not 3 Parallel models (it will reduce time and memory needed for learning)
2. Then, in training got only one forward pass of all images from batch
3. Then, given features, create all possible positive pairs. For each positive pair choose negative based on idea from VGG-Face (so, must be in margin, but negative example can be closed to anchor than positive, FaceNet does not allow it). From all negative example satisfying margin, choose one randomly. So, number of triplets in much higher than batch size (for example, batch size:140, triplets:700)
4. Such pairs go to TripletLoss, go forward and backward.
5. Then the gradient from TripletLoss must be spread into Model. I do this by averaging the gradient of example from each triplet it occurs (at least 1). You need to have a table which will store indexes of example which form triplet.
6. Then backward the gradient in model.


Pros:
  • much faster than current implementation
  • create as many triplet as possible, so it is a better distribution of gradient (more robust to noise)
Cons:
  • we can handle only batch size which fit to GPU memory. So we could not create 1800 batch size, choose negative example and then process them sequentially. In following implementation, the number of images in batch must equal the batch size in GPU
  • as we do not have such big batch, we could not choose the best hard negative examples


My current test are done using hdf5 files and creating the embedding. The result from current pipeline and proposed one are similar, both greatly overfitt. But I think this could work with FaceNet model. I am planning to implement it and test.  

What do you think about such pipeline?

 

Brandon Amos

unread,
Nov 18, 2015, 3:51:50 PM11/18/15
to Bartosz Ludwiczuk, CMU OpenFace
> 1. Define only one model, not 3 Parallel models (it will reduce time and
> memory needed for learning)

This, in combination with:

> 2. Then, in training got only one forward pass of all images from batch

seem very reasonable and should speed up training.


> - as we do not have such big batch, we could not choose the best hard
> negative examples

Would it help to store the representations from the past few
mini-batches and use them to select harder negatives?
Then for sampling a batch, you could pass the anchors and positives
first, then select negatives from comparing them to the outdated
representations from the past few mini-batches, assuming they won't
change much.


-Brandon.
signature.asc

bda...@vt.edu

unread,
Dec 28, 2015, 9:33:57 PM12/28/15
to CMU-OpenFace, melg...@gmail.com
Maybe something like Torch's SpatialContrastiveNormalization?
I think you are right. I can not figure out any other method. 

Hi, I'm posting here for completeness and added the following
comment inline in the model definitions.

I think the normalization is local response normalization,
which is not SpatialContrastiveNormalization in Torch.
The FaceNet paper just says `norm`, but that it's based
heavily on the inception paper (http://arxiv.org/pdf/1409.4842.pdf),
which uses pooling and normalization in the same way in the early layers.

The Caffe and official versions of the inception network both use LRN:

define LRN to be across channels.

I'm currently using fbcunn's CrossMapNormalization for this layer,
and implements

However, fbnn's CPU execution of this layer isn't built by default and
depends on the Intel MKL, which most people (including myself) don't have.
I don't think the README makes clears and I filed an issue at

If LRN helps the model, I plan to switch to using

Does this interpretation seem reasonable?

-Brandon. 

Bartosz Ludwiczuk

unread,
Dec 30, 2015, 4:13:51 AM12/30/15
to CMU-OpenFace, melg...@gmail.com
Hi Brandon,
thanks for the update.
As you said, there is a lot of unclear thing for FaceNet paper, like this norm and other stuff. We will see if Norm does work.

From the last time I was analyzing the algorithm from FaceNet and OpenFace and I think that there are more discrepancies.
Here are the places where I see the difference: (quotes are from FaceNet paper)
1. "we use all anchor-positive pairs": current in pipeline is use only one possible combination between anchor-positive (each anchor get one random positive example). Here should be created all possible pairs, so much more than now.
2. "In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint in Eq.1": so for evaluation only triplets which violate constraint should be used. Why? Because other triplets produce "0" gradient, which then lower the final gradient update in model (so, all chosen triplet should be in margin, other should not be considered). I think which I am not sure is merging the gradient from triplets to the model. I have doubts if the gradient from each sample should be averaged (as one sample could be used several time) or just live it as is. 
3. Based on both points, current pipeline of processing image should be changed (I mean using three copy of models is wrong idea). As we want to generate all possible triplets, the pipeline should be following:
  • one model forward all images in one pass - maximum number based on memory consumption, for 4 GB it would be 140 for nn4, for 12 GB it could be ~ 450
  • create all possible positive pairs using embeding (remember truth idx from embeding matrix)- for batch size 10 people each 14 images gives ~800 pairs 
  • choose random negative example which violate the triplet constraint. If pair have no such negative example, remove it 
  • go through criteria and calculate gradient (forward and backward pass)
  • map the gradient from triplets to embedding
  • make a backward pass through model
       It is much faster than current version and it can be run at single GPU


Addition note:
1. As CASIA-WebFace is much smaller than Google data, I think that choosing the the triplets only in margin will slow down the convergence. It is much better to use idea from Oxford-Face and choose random triplet which violate constraint (so negative example can be closer to anchor than positive).
2. The testing procedure is hard, I think there should not be sth like "random triplets", such number does not inform us about performance of model. I was thinking about implementing checking the LFW score after each epoch (as we use such metric to evaluate model). This will clearly show if model go into right direction (Google use Verification accuracy too, but they use their data). 
3. I more note from paper, which I do not implement: "Additionally, randomly sampled negative faces are added to each mini-batch". It could boost performance, but not sure if should be implemented now. 
4. I found new data set: CelebFace-A. It have 200k images and it is old CelebFace database but with additional attributes. So merging 3 databases (CASIA, FaceScrub and CelebFace) gives 0.8M faces.  

I have implemented such idea and test it using CASIA and nn4 model. It leads to ~89% after 1 days of training (additional I use Cudnn v4 for nn4 model, I have GTX 980). So it leads to better performance.  
So, I have code for such idea but it is really dirty (I am still searching bugs/find idea to get higher accuracy). I may clean it and push it to new repo, where any body could test it.  
I think that not only the amount of data limit the accuracy, but the training algorithm too. I was thinking that maybe Oxford-Face will release their code and it will clarify the idea, but it look like we have do it on our own. 

Brandon Amos

unread,
Dec 30, 2015, 5:44:24 PM12/30/15
to Bartosz Ludwiczuk, CMU-OpenFace
Hi Bartosz,

Thanks! Great information and thoughts.

I don't want the content getting lost in the mailing list
and I've added it to our docs/website at:
http://cmusatyalab.github.io/openface/training-new-models/#discrepancies-between-openface-and-facenet-training

> 4. I found new data set: CelebFace-A
> <http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html>. It have 200k images and
> it is old CelebFace database but with additional attributes. So merging 3
> databases (CASIA, FaceScrub and CelebFace) gives 0.8M faces.

Great to know about.
I'll add CelebFace-A to my combined dataset for the model I'm training.

> I have implemented such idea and test it using CASIA and nn4 model. It
> leads to ~89% after 1 days of training (additional I use Cudnn v4 for nn4
> model, I have GTX 980).

Wow! 1 day is really fast.

I originally didn't use cudnn so I wouldn't have to post-process the
model to convert back to nn layers, but this Torch mailing list post
makes it look easy:
https://groups.google.com/d/msg/torch7/i8sJYlgQPeA/wiHlPSa5-HYJ

> So, I have code for such idea but it is really dirty (I am still searching
> bugs/find idea to get higher accuracy). I may clean it and push it to new
> repo, where any body could test it.

I'm interested in the code, post a link here when you're finished.

-Brandon.
signature.asc

Brandon Amos

unread,
Dec 30, 2015, 6:02:08 PM12/30/15
to CMU-OpenFace, melg...@gmail.com, ba...@cs.cmu.edu
Hi Bartosz,

> 4. I found new data set: CelebFace-A
> <http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html>. It have 200k images and
> it is old CelebFace database but with additional attributes. So merging 3
> databases (CASIA, FaceScrub and CelebFace) gives 0.8M faces.

Great to know about.
I'll add CelebFace-A to my combined dataset for the model I'm training.

Where can I find information about what identity is mapped to each image in CelebFace-A?
I can't find it in the README or the .txt files.

-Brandon.

Bartosz Ludwiczuk

unread,
Dec 31, 2015, 4:01:59 AM12/31/15
to CMU-OpenFace, melg...@gmail.com, ba...@cs.cmu.edu
I did not check the data before linking it here. I was thinking that they provided mapping (as they write "10,177 number of identities") but it seems that only attributes. 
I will write to the authors and ask if they have the mapping from images to identities.

About code, I will post it next week. 

Bartosz Ludwiczuk

unread,
Dec 31, 2015, 12:00:15 PM12/31/15
to CMU-OpenFace, melg...@gmail.com, ba...@cs.cmu.edu
I get answer from the authors of CelebFaceA. They do not provide mapping from images to identities but they are planning to. So we have to wait until they release the mapping then we can merge it to CASIA and FaceScrub.

Brandon Amos

unread,
Dec 31, 2015, 2:20:08 PM12/31/15
to Bartosz Ludwiczuk, CMU-OpenFace
> I get answer from the authors of CelebFaceA. They do not provide mapping
> from images to identities but they are planning to. So we have to wait
> until they release the mapping then we can merge it to CASIA and FaceScrub.

I see, hopefully the identities will be compatible with CASIA-WebFace
and FaceScrub.
I've added an automatic monitor to their website and will
post here when the identity info has been added.

-Brandon.
signature.asc

Bartosz Ludwiczuk

unread,
Jan 4, 2016, 9:32:45 AM1/4/16
to CMU-OpenFace, melg...@gmail.com, ba...@cs.cmu.edu
Hi,
here is my version of training code for OpenFace: https://github.com/melgor/Triplet-Learning
The files which are different from original have suffix "_fast". Additional in "OpenFaceOptim" there is a method with  suffix "_fast".
Everything should work like in the previous version but faster and produce better results (even more than 90%).
Currently there is no test-phase. I implemented LFW testing after each epoch but is to "hacky". I will try to rewrite it and then I will push the code too.
This is not exactly the version I am using so if anyone have problem with running it, let me know.


Main changes:
  • add cudnn modules for Networks
  • get idea of "margin" from Oxford (because of smaller batch and less data)
  • speed up choosing triplets
  • speed up learning ~3x by only doing one F/B pass of model (original OpenFace have 4)
  • now batchSize = opt.peoplePerBatch * opt.imagesPerPerson
  • in trainHook I added some Data Augumentation technique, but I am not sure if they are correct (but they boost the accuracy by a small value)
Additional dependencies:
Plans:
  • find any bug or cause of low accuracy 
  • add more options for a optimizations (regimes etc) and save them
  • plot results like in Cifar example

Bartosz Ludwiczuk

unread,
Jan 6, 2016, 3:17:07 PM1/6/16
to CMU-OpenFace
Hi,
as I still want to get as high accuracy as possible, I found very interesting paper: Embedding Label Structures for Fine-Grained Feature Representation. Their analysis show:
1. Raw triplet learning does not work so great. Need more data or more time (what is what Google FaceNet have)
2. It is much better to firstly learn net in classification, then finetune it by Triple-Learning
3. The best option is to learn both classification and embedding, 

I will try to run some experiment with option "2". I do not know what is wrong with current implementation of Triplet-Loss (why accuracy is so low using nn4 model). Maybe it is time to try other improvements?


 

Brandon Amos

unread,
Jan 6, 2016, 4:01:14 PM1/6/16
to Bartosz Ludwiczuk, CMU-OpenFace
Hi Bartosz,

> Embedding Label Structures for Fine-Grained Feature Representation
> <http://arxiv.org/abs/1512.02895>. Their analysis show:

Interesting!

> 2. It is much better to firstly learn net in classification, then finetune
> it by Triple-Learning

This seems consist with the bootstrapping from the VGG-Face paper.

> I will try to run some experiment with option "2"

This seems reasonable to implement.
If you haven't already seen it, Soumith's ImageNet example code
is a good example of training classification and is what I based
the OpenFace training code on:
https://github.com/soumith/imagenet-multiGPU.torch

> why accuracy is so low using nn4 model

By low accuracy here, I assume you're referring to the ~90% accuracies
your improvements to the existing training provide and you want to
target ~99% accuracy.

-Brandon.
signature.asc

Bartosz Ludwiczuk

unread,
Jan 7, 2016, 2:14:37 AM1/7/16
to CMU-OpenFace, melg...@gmail.com, ba...@cs.cmu.edu
Hi,
yes, I want to get > 97%, 99% would be nice :) I will be using Soumith's ImageNet example.

I see that you merge training code into the master branch, did you check if it works? Or did you get any results?

Brandon Amos

unread,
Jan 7, 2016, 2:36:24 PM1/7/16
to Bartosz Ludwiczuk, CMU-OpenFace
> I see that you merge training code into the master branch, did you check if
> it works? Or did you get any results?

Yes, the training code with your improvements is working well!
A new model I trained with it gets 91.53% LFW accuracy.

-Brandon.
signature.asc

Bartosz Ludwiczuk

unread,
Jan 7, 2016, 3:13:32 PM1/7/16
to Brandon Amos, CMU-OpenFace
Nice!
As I understand, you use model from current version of repo. I mean: no 5x5 convolution in last Inception and SpatialCrossMapLRN, right? 
I was not testing this model, this is why I am asking.

Brandon Amos

unread,
Jan 7, 2016, 3:17:55 PM1/7/16
to Bartosz Ludwiczuk, CMU-OpenFace
> As I understand, you use model from current version of repo. I mean: no 5x5
> convolution in last Inception and SpatialCrossMapLRN, right?
> I was not testing this model, this is why I am asking.

Correct, and I didn't experiment with the impacts of using them vs
my previous model without them.
signature.asc

glasses cat

unread,
Mar 31, 2016, 3:59:38 AM3/31/16
to CMU-OpenFace
Hi,

I'm very interested in your ideas about item 2 and 3, any progress on it?

Best,
Richard

在 2016年1月7日星期四 UTC+8上午4:17:07,Bartosz Ludwiczuk写道:
Reply all
Reply to author
Forward
0 new messages