I'm trying to calculate the classification error for several pre-trained models.
I start by predicting the top 5 classes and then, for each of the predicted classes, I perform a backward pass and using the saliency map, I draw a bounding box covering the pixels with high intensity.
Then, I crop the input image by the bounding box, resize it and use it as input of the network, obtaining the new top 5 predicted classes. I do this for all the top 5 classes.
In conclusion, I start by having 5 predictions and for each I got about 5, having totally 25 predicted classes which I finally rank ending up with a top 5 predicted class as the final solution. This method is my implementation of the paper "Look and Think Twice".
Problem: My problem is: when I use the GoogLeNet pre-trained model to load in the network and calculate the classification error, and getting the top 1 and top 5 exactly equal, what is very strange!! The issue seems to be showing up when I used the cropped image as input because when this new image is loaded to the network and a prediction is done, the top 5 classes are all the same. I don't know how to solve this. Any idea??
Thank you!