why the test accuracy is different from caffe and self tested.

peng yu

unread,

Oct 14, 2014, 3:53:24 AM10/14/14

to caffe...@googlegroups.com

Hi,

I have try to test the mnist and cifar10 dataset. using caffe to train a model with 99% and 75% accuracy.

and i load the test images (according to the test dataset) with python and use the python caffe to predict and do the evaluation upon accuracy. but i got a relative lower result than 99% or 75%. so what's the problem?

peng yu

unread,

Oct 14, 2014, 11:50:39 PM10/14/14

to caffe...@googlegroups.com

I have paste my python file which did accuracy testing by my self.

https://dpaste.de/4uC3
-------------
import numpy as np
import matplotlib.pyplot as plt
import sys

# Make sure that caffe is on the python path:
caffe_root = '/home/work/caffe/' # this file is expected to be in {caffe_root}/examples
sys.path.insert(0, caffe_root + 'python')
import caffe
from caffe.proto import caffe_pb2
from caffe.io import blobproto_to_array

MODEL_FILE = '/home/work/caffe/examples/cifar10/cifar10_quick.prototxt'
PRETRAINED = '/home/work/caffe/examples/cifar10/cifar10_quick_iter_5000.caffemodel'
MEAN_FILE= '/home/work/caffe/examples/cifar10/mean.binaryproto'
TEST_FILE = '/home/work/Downloads/cifar-10-batches-py/test_batch'
META_FILE = '/home/work/Downloads/cifar-10-batches-py/batches.meta'
TRAIN_FILE = '/home/work/Downloads/cifar-10-batches-py/data_batch_1'

def unpickle(file_name):
    import cPickle
    fo = open(file_name, 'rb')
    mdict = cPickle.load(fo)
    fo.close()
    return mdict

def load_mean(mean_file=MEAN_FILE):
    blob = caffe_pb2.BlobProto()
    data = open(mean_file, "rb").read()
    blob.ParseFromString(data)
    nparray = blobproto_to_array(blob)
    return nparray[0]

def main():
    train = unpickle(TRAIN_FILE)
    test = unpickle(TEST_FILE)
    meta = unpickle(META_FILE)
    train_data = train['data']
    train_label = train['labels']
    test_data = test['data']
    test_label = test['labels']
    test_data = map(lambda x: x.reshape((32, 32, 3)), test_data)
    train_data = map(lambda x: x.reshape((32, 32, 3)), train_data)

    net = caffe.Classifier(MODEL_FILE, PRETRAINED,
                           mean=load_mean(),
                           channel_swap=(2,1,0),
                           raw_scale=255,
                           image_dims=(32, 32))

    total, accu = 0, 0
    for i, j in zip(train_data, train_label):
        total += 1
        res = net.predict([i])
        if res[0].argmax() == j:
            accu += 1
        print 'Already run: %s, accuracy: %s'%(total, float(accu)/total)

if __name__ == "__main__":
    main()
----------

the outcoming is really no good...

在 2014年10月14日星期二UTC+8下午3时53分24秒，peng yu写道：

David Chik

unread,

Oct 15, 2014, 1:10:05 AM10/15/14

to caffe...@googlegroups.com

This is a normal thing. In Caffe during runtime it only takes a sample to calculate the accuracy so it may not be very reliable. To make it more reliable, you need to increase the test batch size and also shuffle your data before feeding to Caffe.

Evan Shelhamer

unread,

Oct 15, 2014, 1:27:08 AM10/15/14

to David Chik, caffe...@googlegroups.com

Peng Yu: differences between train/val and deploy net performance are almost always due to differences in preprocessing. The configuration must be carefully matched. Note the format details on http://www.cs.toronto.edu/~kriz/cifar.html and double-check that your configuration of the pycaffe input preprocessing and unpickling of cifar-10 match the db creation and data layer. One potential suspect: I believe the cifar-10 data is RGB so you should not be doing channel swapping.

David Chik: no, that's now how it works. The test net in the Caffe examples is configured to deterministically run batches over the test set and average the accuracy and loss. there is no sampling, and order has absolutely no effect since each instance within a batch is independently predicted. Mini-batch sampling and shuffling are a training time consideration only.

Evan Shelhamer

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/7fb1d949-09ac-4d0c-9adc-1f28e3ae2680%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Evan Shelhamer

unread,

Oct 15, 2014, 1:28:59 AM10/15/14

to David Chik, caffe...@googlegroups.com

The delicacy of input preprocessing is a strong motivation for https://github.com/BVLC/caffe/issues/1245

Evan Shelhamer

peng yu

unread,

Oct 15, 2014, 1:37:11 AM10/15/14

to caffe...@googlegroups.com, dr.dav...@googlemail.com, shel...@eecs.berkeley.edu

Thanks Evan, I have double checked the configuration, and disabled the channel swapping..

the result is still so bad that must something went wrong..

I really appreciate your help and that would be nice, if you can run my code on your machine to figure out.. why

在 2014年10月15日星期三UTC+8下午1时27分08秒，Evan Shelhamer写道：

Evan Shelhamer

unread,

Oct 15, 2014, 2:05:12 AM10/15/14

to peng yu, caffe...@googlegroups.com, David Chik

No. The BVLC developers cannot check user code. Good luck.

Evan Shelhamer

To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/bfd0e629-01f8-4ecc-94d6-4037c146883b%40googlegroups.com.

peng yu

unread,

Oct 15, 2014, 2:41:50 AM10/15/14

to caffe...@googlegroups.com, yup...@gmail.com, dr.dav...@googlemail.com, shel...@eecs.berkeley.edu

ok.. but still, i have checked the configure file..

seems, user can easily change the batch size exceed the actually data size..

so i was wondering, is there any way to guarantee that the test data blob is exactly the same with the test data in the lmdb or leveldb ?

------------ in the mnist settings,

i have change the data prefetching thread to output the log of " Restarting data prefetching from start"

and in the solver configure file, i change the test_iter into 1,and in the train_test, test batch size 10000.

the test data size in the database is 10000.

I still get the " Restarting data prefetching " from start for twice,

so... yeah,is there anyway to guarantee i am testing the whole test data set via caffe blob

在 2014年10月15日星期三UTC+8下午2时05分12秒，Evan Shelhamer写道：

David Chik

unread,

Oct 15, 2014, 11:09:10 AM10/15/14

to caffe...@googlegroups.com, yup...@gmail.com, dr.dav...@googlemail.com, shel...@eecs.berkeley.edu

Are you sure, Evan? Even if test_iter x test_batch_size << amount of test data?

My results show the opposite, perhaps because I have big data something like 20M training data and 2M test data. When I have 2M test data, if I set test_iter = 100 and batch_size = 100 and if I do not shuffle the test data, the test accuracy will be very wrong and fluctuating up and down for each display.

What I can say is Caffe is still very bad at estimating test accuracy, but Theano does a much better job.

Evan Shelhamer

unread,

Oct 15, 2014, 11:48:34 AM10/15/14

to David Chik, caffe...@googlegroups.com, yup...@gmail.com

David: I'm having a hard time seeing your point. If you configure Caffe to not run on your whole test set, then certainly yes, it will not compute the loss / accuracy / output of the whole test set. How is that unreasonable?

If you have a favorite evaluation feature in theano (do you mean pylearn2?) then a PR or precise description are welcome.

Over-generalizations from your own usage are not. All of the bundled examples deterministically run over the relevant test/validation set without re-ordering or sampling.

To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/034ceef4-87e6-473b-a7bd-1fc3d7adb26e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Evan Shelhamer

bo chen

unread,

Aug 18, 2015, 12:38:45 AM8/18/15

to Caffe Users

I had the same issue, it seems like the problem was in my deploy.prototxt file where the name of the layers modified was not matching with the ones modified in train_val.txt.

Fixing the error resulted in huge bump for the confidence score for the classes predicted.

- Bo

Saman Sarraf

unread,

Apr 18, 2016, 1:49:21 PM4/18/16

to Caffe Users, dr.dav...@googlemail.com, yup...@gmail.com, shel...@eecs.berkeley.edu

Dear Evan,

if I have 80000 samples and I set test iter = 800 and batch size 100 , Caffe will cover all testing data ? or I need to set batch size = 800 and test iter = 100 ? In both cases, is there any possibility that any samples are ignored for example because of shuffling or something like that?

Thank you,

Saman

To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users+unsubscribe@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/034ceef4-87e6-473b-a7bd-1fc3d7adb26e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Evan Shelhamer

Jan

unread,

Apr 19, 2016, 5:24:35 AM4/19/16

to Caffe Users, dr.dav...@googlemail.com, yup...@gmail.com, shel...@eecs.berkeley.edu

Both cases work in theory. Only a batch size of 800 needs much more GPU memory than a batch size of 100, so you might run into out of memory errors. You can also do batch size = 50 and test_iter = 1600. No functional difference. And shuffling is only used if explicitly turned on. Which one need not/should not do for testing. That should get all of your test set samples processed. There is no mechanism to ignore samples somehow.

Jan

Saman Sarraf

unread,

Apr 19, 2016, 2:47:19 PM4/19/16

to Caffe Users, dr.dav...@googlemail.com, yup...@gmail.com, shel...@eecs.berkeley.edu

Thanks Jan - just for making sure so If I get the averaged test accuracy around 95% once training was done , and I use the C++ or Python classification API to predict the same testing samples, I should get something close to 95% (assuming the preprocessing is the same) , is that correct?

Jan

unread,

Apr 21, 2016, 3:35:42 AM4/21/16

to Caffe Users, dr.dav...@googlemail.com, yup...@gmail.com, shel...@eecs.berkeley.edu

Well, you mean if you use the API to feed the same samples and compute the (average) accuracy manually? Then you should indeed get the very same value: 95% in your example. Since your doing essentially the same as caffe does.

Of course given the same preprocessing.

Jan

Message has been deleted

鄭祐晨

unread,

Nov 22, 2016, 7:44:09 AM11/22/16

to Caffe Users

I had the same problem as you,

It turned out that I forgot to subtract mean value from input image when deploy

Px Zhan

unread,

Feb 20, 2017, 8:08:13 AM2/20/17

to Caffe Users

I have the same problem,I wonder how you fix it? Thanks.

在 2014年10月14日星期二 UTC+8下午3:53:24，peng yu写道：

Saman Sarraf

unread,

Feb 20, 2017, 12:22:02 PM2/20/17

to Caffe Users

Hi there,

You need to make sure of image resizing and using correct mean of image.

Those were my issues that I could solve.

And one more thing, you need also to make sure your deploy file has identical network architecture as your train prototext.