Top-1 Accuracy Top-5 Accuracy

0 views

Skip to first unread message

Martin Glow

unread,

Aug 3, 2024, 5:28:10 PM8/3/24

to lirotinti

I have come across few (Machine learning-classification problem) journal papers mentioned about evaluate accuracy with Top-N approach. Data was show that Top 1 accuracy = 42.5%, and Top-5 accuracy = 72.5% in the same training, testing condition.I wonder how to calculate this percentage of top-1 and top-5?

The Complement of the accuracy is the error, The top-1 error is the percentage of time that the classifier did not give the correct class highest probability score.The top-5 error:- The percentage of time that the classifier did not include the correct class among the top 5 probabilities or guesses.

In ImageNet Classification with Deep ConvolutionalNeural Networks by Krizhevsky et al. every solution based on one single CNN (page 7) has no top-5 error rates while the ones with 5 and 7 CNNs have (and also the error rate for 7 CNNs are better than for 5 CNNs).

as a result. The Top-1 class is "mouse". The top-2 classes are mouse, dog. If the correct class was "dog", it would be counted as "correct" for the Top-2 accuracy, but as wrong for the Top-1 accuracy.

I have come across a few (Machine learning-classification problem) journal papers mentioned evaluate the accuracy with the Top-N approach. Data was show that Top 1 accuracy = 42.5%, and Top-5 accuracy = 72.5% in the same training, testing condition. I wonder how to calculate this percentage of top-1 and top-5?

Let's say you're applying a machine learning algorithm for object recognition using a neural network. A picture of a cat is shown, and these are the outputs of your neural network:

The complement of the accuracy is the error:

The top-1 error:- The percentage of time that the classifier did not give the correct class the highest probability score.
The top-5 error:- The percentage of time that the classifier did not involve the correct class among the top 5 probabilities or guesses.

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Recently while working on a computer vision project of multi-class classification problem for activity detection where task was to classify activity of a person like sitting, singing, writing, eating etc using complex CNN model , accuracy was not upto mark but prediction seemed to be mostly correct.

In many multiclass classification problems in NLP and Computer Vision Space, TopN (like top3, top 5) accuracy is a better measure of model performance. Example, in recommender system it makes more sense to show top 3 or 5 recommended products to customer rather than just single top most recimmended product. Top N accuracy is standard way of measuring model performance in all the Imagenet competitions.

If we consider usual top1 accuracy, predictions are correct 2/5 times and resulting accuracy is 40%. If we consider top2 predictions,the true labels are there in top 2 predicted values for 3/5 times hence top2 accuracy is 60%.

Scitkit learn does not provide any library to calculate topN accuracy similar other usual classificaton metrics like recall, precision, F1 score etc. Keras and Pytorch frameworks provides functions to calculate topN accuracy. However, in certain situtations, you might need to write a custom code to calculate top N accuracy. Following Python code snippet will expain the concept and actual code which you can use directly:

Step3: Finally you need a logic to calculate mean top N accuracy based on the number of successes on finding the actual label in the top N predicted labels. I have created following function where you can pass Actual label(A), predicted probabilities(C) and n( 3,5 etc) to get top N accuracy.

In the paper ImageNet Large Scale Visual Recognition Challenge by Russakovsky et at. (2014), there is a section in which they estimated the human classification error for ILSVRC (Section 6.4.1 Quantitative comparison of human and computer accuracy on large-scale image classification).

ps.: It is not an easy task for a human being to classify an image when there are a thousand possible classes. In addition to this, it must be taken into account that there is a considerable overlap between classes for some images. Try for yourself here

However, on resnet-101 the accuracy loss is larger. I tried to quantize bias term (rhs constant of add) using 16 bits (32 bits will make lhs overflow after left shift) and obtained good accuracy. Specifically, in add_rewrite I marked the rhs constant as QAnnotateKind.BIAS and modified these lines as:

My patch will change the scale selected in AddRealize. Previously scale of lhs is selected in almost all cases because lhs scale is from conv2d, which is multiplication of input and weight scale (smaller than rhs), so bias will be shifted left in this case.
After my patch, bias scale will be selected and conv2d result will be shifted left.

By the way, have you tried your pass on the v2 versions of resnet in the mxnet model zoo? I see catastrophic accuracy drops with the current pass on those models, and I wonder if it is due to problems with the bias.

By the way, I noticed that 2723 added a skip callback _module.py#L183 which breaks CSE (does not crunch down matching int32 casts) for my per-channel quantization pass. Is there another use case that requires this callback?

Right now I think the current fskip implementation is too brittle and has consequences downstream. In my quantization use case skipping cast(i32) causes every identity branch to be exhaustively recomputed because CSE stops at that step.

data is usually output of previous conv2d. There are duplicated simulated_quantize. Followed add in both branches will convert the int8 to int32. So simulated_quantize + add in both branches which will be translated to right_shift + cast(i8) + cast(i32)
We use stop_fusion to ensure that previous conv2d result will be casted to int8 before saving in global memory.

So the issue is I think we have somewhat different use cases :); I am prototyping per-channel quantization on CPU, where the compute:bandwidth ratio is lower so the different is probably not as apparent. However, in my situation preventing the casts from being removed also explodes even resnet-18 to over 3000 intermediate values which is far worse than the bandwidth overhead. I wonder if modifying the annotate pass to treat the adds differently here will work.

Convolutional neural networks (CNNs) are commonly developed at a fixed resource cost, and then scaled up in order to achieve better accuracy when more resources are made available. For example, ResNet can be scaled up from ResNet-18 to ResNet-200 by increasing the number of layers, and recently, GPipe achieved 84.3% ImageNet top-1 accuracy by scaling up a baseline CNN by a factor of four. The conventional practice for model scaling is to arbitrarily increase the CNN depth or width, or to use larger input image resolution for training and evaluation. While these methods do improve accuracy, they usually require tedious manual tuning, and still often yield suboptimal performance. What if, instead, we could find a more principled method to scale up a CNN to obtain better accuracy and efficiency?

The first step in the compound scaling method is to perform a grid search to find the relationship between different scaling dimensions of the baseline network under a fixed resource constraint (e.g., 2x more FLOPS).This determines the appropriate scaling coefficient for each of the dimensions mentioned above. We then apply those coefficients to scale up the baseline network to the desired target model size or computational budget.

The effectiveness of model scaling also relies heavily on the baseline network. So, to further improve performance, we have also developed a new baseline network by performing a neural architecture search using the AutoML MNAS framework, which optimizes both accuracy and efficiency (FLOPS). The resulting architecture uses mobile inverted bottleneck convolution (MBConv), similar to MobileNetV2 and MnasNet, but is slightly larger due to an increased FLOP budget. We then scale up the baseline network to obtain a family of models, called EfficientNets.

We have compared our EfficientNets with other existing CNNs on ImageNet. In general, the EfficientNet models achieve both higher accuracy and better efficiency over existing CNNs, reducing parameter size and FLOPS by an order of magnitude. For example, in the high-accuracy regime, our EfficientNet-B7 reaches state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on CPU inference than the previous Gpipe. Compared with the widely used ResNet-50, our EfficientNet-B4 uses similar FLOPS, while improving the top-1 accuracy from 76.3% of ResNet-50 to 82.6% (+6.3%).

Though EfficientNets perform well on ImageNet, to be most useful, they should also transfer to other datasets. To evaluate this, we tested EfficientNets on eight widely used transfer learning datasets. EfficientNets achieved state-of-the-art accuracy in 5 out of the 8 datasets, such as CIFAR-100 (91.7%) and Flowers (98.8%), with an order of magnitude fewer parameters (up to 21x parameter reduction), suggesting that our EfficientNets also transfer well.

By providing significant improvements to model efficiency, we expect EfficientNets could potentially serve as a new foundation for future computer vision tasks. Therefore, we have open-sourced all EfficientNet models, which we hope can benefit the larger machine learning community. You can find the EfficientNet source code and TPU training scripts here.

Despite these dramatic improvements, we still had questions about the time-to-accuracy metric. Can competitors cherry-pick the best training run? Do models optimized for time-to-accuracy generalize well? How important is the accuracy threshold? Can optimizing models for time-to-accuracy teach us anything?