the same version of Caffe with the same model, but having different result on different CPU

42 views
Skip to first unread message

maay

unread,
Apr 18, 2018, 6:24:11 AM4/18/18
to Caffe Users
Like the title, I install the same version of Caffe and run the same model, but got different result on different computer with different CPU, does someone knows the reason? and how to solve it? thanks.

Przemek D

unread,
Apr 18, 2018, 6:43:47 AM4/18/18
to Caffe Users
How different are the results? What CPUs are we talking about? What OSes? What compilers? What model? What data?
Hard to say what is the cause without any information, you have to narrow it down for us to help us find it.

maay

unread,
Apr 18, 2018, 10:54:52 PM4/18/18
to Caffe Users
Thank you for reply.

I describe the two different types of computers that I use.

CPUs: "Intel® Core i7-7700"  VS.  "Intel® Xeon® CPU E5-2620 v4 @ 2.10GHz"

GPUs: both GPU GeForce GTX 1080

OSes: both Ubuntu 14.04

Compilers: both Caffe 1.0.0-rc5 (GPU mode with cuda-8.0, cudnn-v5.05) + python 2 inference code 

Model: 
I use own design model that is like (conv-prelu-BatchNorm-Scale) x 14 -conv-pooling, finally get one output value. Train label is 0,1,2,3 4, The model's output value will range in about -0.x ~ 5.x 

How different are the results: 
The model's input data is an image, input size is about 500 x 500. 
I tested the same set of images as input, the model got different output value on two computers. 
In 50 images, the difference is between -0.21 ~ +0.24.

What's the possible reason. Are there some methods to resolve it. 

Przemek D於 2018年4月18日星期三 UTC+8下午6時43分47秒寫道:

Przemek D

unread,
Apr 19, 2018, 4:44:35 AM4/19/18
to Caffe Users
Please clarify one more thing: you take the same model (prototxt) and the same dataset, train it on each machine in parallel, and then test? Because in this case the results are expected to differ. Even on a single machine, two independent runs of the same training will yield two slightly different models with different test results (due to randomness in stuff like transformations, dropout etc.).
It would be different if you took the same trained model (prototxt+caffemodel) and only tested it on two machines - those results should be generally the same (unless you explicitly use randomness, like oversampling).

Xun Victor

unread,
Apr 19, 2018, 7:07:10 AM4/19/18
to Caffe Users
Hi,
There is little chance that the actual problem comes from the hardware being different.

Have you tried running multiple time on the SAME computer to see if even there the results change from run to run?
Have you tried fixing the caffe random seed?
caffe.set_random_seed(42)
Are you sure the input data is shuffled the same way? If you use batch norm and use_global_stats = false; the batch norm layers will output different results.

maay

unread,
Apr 20, 2018, 2:15:49 AM4/20/18
to Caffe Users
The model is trained just once.
yes, I took the same trained model (prototxt+caffemodel) and tested it on two different machines.

I use python code to do Caffe inference.
The following code is the key part of inference that I use:
net = caffe.Classifier(deploy, model,
                       mean
=np.load("Mean.npy").mean(1).mean(1),
                       channel_swap
=(2,1,0),
                       raw_scale
=255,
                       image_dims
=(512,512))
image
= caffe.io.load_image(filename)
output_prob
= net.predict([image], oversample=True)

This is the definition of predict function and Classifier class:
The definition of oversample function is here:


I usoversample=True, test image should be fixedly cropped and mirrored. Then produce 10 specific regions of test image. Then average 10 output results of 10 specific regions as final result.

The oversample here is not random because the result of test multiple times on one computer is the same.

Przemek D於 2018年4月19日星期四 UTC+8下午4時44分35秒寫道:

maay

unread,
Apr 20, 2018, 2:34:08 AM4/20/18
to Caffe Users
Hi,
  1. I tried to run multiple time on the SAME computer, the results are the same.
  2. Is caffe.set_random_seed used just in training phase? I only trained the model once.
  3. mine batch_norm layer in deploy.prototxt is written like below:
layer {
  name: "bn_conv1"
  type: "BatchNorm"
  bottom: "conv1"
  top: "conv1_BN"
  batch_norm_param {
    use_global_stats: true
  }
  include {
phase: TEST
  }
}

Xun Victor於 2018年4月19日星期四 UTC+8下午7時07分10秒寫道:

Xun Victor

unread,
Apr 20, 2018, 3:51:27 AM4/20/18
to Caffe Users
Hi maay,

You're prototxt seems correct.
If I was you I would try to print the values of the net blobs at each layer once you feed the same image on both computers.
# First get list of blobs (after you have forwarded an image):
blob_names = net.blobs.keys()
# Then loop over blobs and print values
for blob_name in blob_names:
print net.blobs[blob_name].data
# (you can set a pause after each print)
import ipdb; ipdb.set_trace()

See where the result begins to differ between the two computers. Especially the net.blobs['data'].data (which corresponds to the representation of the image at the beginning of the net).
If the input data is already different, maybe the difference is due to external libraries (function to read image etc...). If all the blobs are the same, the problem comes from any post-processing you perform.
Reply all
Reply to author
Forward
0 new messages