Imagenet-21k Classes

0 views

Skip to first unread message

Robert

unread,

Aug 5, 2024, 12:56:00 PM8/5/24

to criblobarlea

TheImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million[1][2] images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided.[3] ImageNet contains more than 20,000 categories,[2] with a typical category, such as "balloon" or "strawberry", consisting of several hundred images.[4] The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet.[5] Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.[6]

On 30 September 2012, a convolutional neural network (CNN) called AlexNet[7] achieved a top-5 error of 15.3% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner up. Using convolutional neural networks was feasible due to the use of graphics processing units (GPUs) during training,[7] an essential ingredient of the deep learning revolution. According to The Economist, "Suddenly people started to pay attention, not just within the AI community but across the technology industry as a whole."[4][8][9]

AI researcher Fei-Fei Li began working on the idea for ImageNet in 2006. At a time when most AI research focused on models and algorithms, Li wanted to expand and improve the data available to train AI algorithms.[11] In 2007, Li met with Princeton professor Christiane Fellbaum, one of the creators of WordNet, to discuss the project. As a result of this meeting, Li went on to build ImageNet starting from the word database of WordNet and using many of its features.[12]

ImageNet crowdsources its annotation process. Image-level annotations indicate the presence or absence of an object class in an image, such as "there are tigers in this image" or "there are no tigers in this image". Object-level annotations provide a bounding box around the (visible part of the) indicated object. ImageNet uses a variant of the broad WordNet schema to categorize objects, augmented with 120 categories of dog breeds to showcase fine-grained classification.[6] One downside of WordNet use is the categories may be more "elevated" than would be optimal for ImageNet: "Most people are more interested in Lady Gaga or the iPod Mini than in this rare kind of diplodocus."[clarification needed] In 2012, ImageNet was the world's largest academic user of Mechanical Turk. The average worker identified 50 images per minute.[2]

The ILSVRC aims to "follow in the footsteps" of the smaller-scale PASCAL VOC challenge, established in 2005, which contained only about 20,000 images and twenty object classes.[6] To "democratize" ImageNet, Fei-Fei Li proposed to the PASCAL VOC team a collaboration, beginning in 2010, where research teams would evaluate their algorithms on the given data set, and compete to achieve higher accuracy on several visual recognition tasks.[12]

The resulting annual competition is now known as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The ILSVRC uses a "trimmed" list of only 1000 image categories or "classes", including 90 of the 120 dog breeds classified by the full ImageNet schema.[6] The 2010s saw dramatic progress in image processing. Around 2011, a good ILSVRC classification top-5 error rate was 25%. In 2012, a deep convolutional neural net called AlexNet achieved 16%; in the next couple of years, top-5 error rates fell to a few percent.[17] While the 2012 breakthrough "combined pieces that were all there before", the dramatic quantitative improvement marked the start of an industry-wide artificial intelligence boom.[4] By 2015, researchers at Microsoft reported that their CNNs exceeded human ability at the narrow ILSVRC tasks.[10][18] However, as one of the challenge's organizers, Olga Russakovsky, pointed out in 2015, the programs only have to identify images as belonging to one of a thousand categories; humans can recognize a larger number of categories, and also (unlike the programs) can judge the context of an image.[19]

By 2014, more than fifty institutions participated in the ILSVRC.[6] In 2017, 29 of 38 competing teams had greater than 95% accuracy.[20] In 2017 ImageNet stated it would roll out a new, much more difficult challenge in 2018 that involves classifying 3D objects using natural language. Because creating 3D data is more costly than annotating a pre-existing 2D image, the dataset is expected to be smaller. The applications of progress in this area would range from robotic navigation to augmented reality.[1]

A study of the history of the multiple layers (taxonomy, object classes and labeling) of ImageNet and WordNet in 2019 described how bias[clarification needed] is deeply embedded in most classification approaches for all sorts of images.[21][22][23][24] ImageNet is working to address various sources of bias.[25]

Released in 2021, this family of image classification models are trained on the full ImageNet-21K dataset, a superset of the ImageNet dataset containing more than 21 thousand classes of objects. Models pretrained on ImageNet-21K and fine-tuned on ImageNet-1K are also available and achieve a high testing accuracy on the ImageNet ILSVRC2012.

Object detection models usually undergo training to identify a limited set of object classes to create checkpoints from which to train new models. For instance, the widely utilized MS COCO (Microsoft Common Objects in Context) dataset consists of only eighty classes, ranging from individuals to toothbrushes.

Expanding this repertoire of classes can be challenging. One needs to go through the laborious process of gathering diverse images for each object class, annotating them, and refining an existing model through fine-tuning. However, what if there existed a straightforward approach to enable a model to learn new categories right from the start? What if it could effortlessly incorporate thousands of additional categories? This is precisely the capability promised by the DETIC (Detic) model, as outlined in the publication by Meta Research.

Detic, introduced by Facebook Research and published in January 2022, is a segmentation model specifically designed for object detection. Detic has the ability to identify 21,000 object classes with strong accuracy, including for objects that were previously challenging to detect.

Traditionally, object detection consists of two interconnected tasks: localization, which involves finding the object within an image, and classification, which involves identifying the object category. Existing methods typically combine these tasks and heavily rely on object bounding boxes for all classes.

However, it's worth noting that detection datasets are considerably smaller in both size and number of object classes compared to image classification datasets. This discrepancy arises due to the larger and more accessible nature of image classification datasets, which results in richer vocabularies.

By incorporating image-level supervision alongside detection supervision, Detic successfully decouples the localization and classification sub-problems. Consequently, the model becomes capable of detecting and classifying an extensive range of objects with exceptional precision and recall.

Detic is the first known model to train a detector on all 21,000 classes of the renowned ImageNet dataset. Detic is both an incredibly versatile and comprehensive foundation model suitable for a wide array of tasks.

Object detection involves identifying both the bounding box and the class or category of an object within an input image. On the other hand, object identification, or classification, solely focuses on determining the class name without considering the bounding box. Traditional object detection models face challenges due to the high cost associated with annotating bounding boxes. This limitation restricts the creation of small datasets and hampers training and detection on only a limited number of classes.

In contrast, object detection requires annotations of labels on a per-image basis, which is a faster process and allows for the creation of larger datasets. Consequently, it becomes feasible to train models to identify a larger number of classes. However, since the dataset lacks bounding box information, it cannot be utilized for object detection.

The Detic model overcomes this predicament by training the object detector on a dataset specifically designed for object detection. This innovative approach, known as Weakly-Supervised Object Detection (WSOD), enables the training of an object detector without relying on bounding box information.

Detic leverages Semi-supervised WSOD using the ImageNet-21K dataset, which is conventionally employed for object detection purposes, to train the Detic model detector. Notably, unlike previous studies, Detic does not provide class labels for the resulting bounding boxes generated by the object detector. Instead, it adopts a different approach.

For each detected bounding box, Detic employs a CLIP embedding vector, which is trained on an extensive dataset. CLIP simultaneously trains an image encoder and a text encoder to predict the correct pairings of (image, text) training examples in a batch. During testing, the learned text encoder embeds the names or descriptions of the classes present in the target dataset.

By utilizing embeddings of classes instead of a fixed set of object categories, Detic establishes a classifier capable of recognizing concepts without explicitly encountering any examples of them. This unique approach expands the model's ability to identify objects beyond predefined categories and enhances its adaptability to various scenarios and datasets.

The ImageNet21k dataset, which was utilized in the training of Detic, is primarily intended for object identification tasks. This dataset exclusively provides labels for entire images rather than individual objects within those images. Despite this distinction, ImageNet21k boasts an extensive collection of class labels, encompassing a remarkable 21,000 categories. Furthermore, it encompasses a vast pool of data, consisting of approximately 14 million images.