In the first series of this article we have seen what is computer vision and a brief review of its applications. You can read the first part of this article here. We have also seen the contribution of deep learning in computer vision. Especially we focused on Image Classification and deep learning architecture which is used in Image Classification. In this series we will focus on other applications including Image Localization, Object Detection and Image Segmentation. We will also walk through the required deep learning architecture used for above applications.
Similar to classification, localization finds the location of a single object inside the image. Localization can be used for lots of useful real-life problems. For example, smart cropping (knowing where to crop images based on where the object is located), or even regular object extraction for further processing using different techniques. It can be combined with classification for not only locating the object but categorizing it into one of many possible categories.
A classical dataset for image classification with localization is the PASCAL Visual Object Classes datasets, or PASCAL VOC for short (e.g. VOC 2012). These are datasets used in computer vision challenges over many years.
Iterating over the problem of localization plus classification we end up with the need for detecting and classifying multiple objects at the same time. Object detection is the problem of finding and classifying a variable number of objects on an image. The important difference is the “variable” part. In contrast with problems like classification, the output of object detection is variable in length, since the number of objects detected may change from image to image.
The PASCAL Visual Object Classes datasets, or PASCAL VOC for short (e.g. VOC 2012), is a common dataset for object detection.
There is nothing hardcore about the architectures which we are going to discuss. What we are going to discuss are some clever ideas to make the system intolerant to the number of outputs and to reduce its computation cost. So, we do not know the exact number of objects in our image and we want to classify all of them and draw a bounding box around them. That means that the number of coordinates that the model should output is not constant. If the image has 2 objects, we need 8 coordinates. If it has 4 objects, we want 16. So how we build such a model?
One key idea to traditional computer vision is regions proposal. We generate a set of windows that are likely to contain an object using classic CV algorithms, like edge and shape detection and we apply only these windows (or regions of interests) to the CNN. To learn more about how regions are proposed, we introduce a new architecture called RCNN.
Given an image with multiple objects, we generate some regions of interests using a proposal method (in RCNN’s case this method is called selective search) and wrap the regions into a fixed size. We forward each region to Convolutional Neural Network (such as AlexNet), which will use an SVM to make a classification decision for each one and predicts a regression for each bounding box. This prediction comes as a correction of the region proposed, which may be in the right position but not at the exact size and orientation.
Although the model produces good results, it suffers from a major issue. It is quite slow and computationally expensive. Imagine that in an average case, we produce 2000 regions, which we need to store in disk, and we forward each one of them into the CNN for multiple passes until it is trained. To fix some of these problems, an improvement of the model comes in play called ‘Fast-RCNN’
The idea is straightforward. Instead of passing all regions into the convolutional layer one by one, we pass the entire image once and produce a feature map. Then we take the region proposals as before (using some external method) and sort of project them onto the feature map. Now we have the regions in the feature map instead of the original image and we can forward them in some fully connected layers to output the classification decision and the bounding box correction.
Note that the projection of regions proposal is implemented using a special layer (ROI layer), which is essentially a type of max-pooling with a pool size dependent on the input, so that the output always has the same size.
And we can take this a step further. Using the produced feature maps from the convolutional layer, we infer regions proposal using a Region Proposal network rather than relying on an external system. Once we have those proposals, the remaining procedure is the same as Fast-RCNN (forward to ROI layer, classify using SVM and predict the bounding box). The tricky part is how to train the whole model as we have multiple tasks that need to be addressed:
As the name suggests, Faster RCNN turns out to be much faster than the previous models and is the one preferred in most real-world applications.
Localization and object detection is a super active and interesting area of research due to the high emergency of real world applications that require excellent performance in computer vision tasks (self-driving cars, robotics). Companies and universities come up with new ideas on how to improve the accuracy on regular basis.
There is another class of models for localization and object detection, called single shot detectors, which have become very popular in the last few years because they are even faster and require less computational cost in general. Sure, they are less accurate, but they are ideal for embedded systems and similar power-hungry applications.
Going one step further from object detection we would want to not only find objects inside an image, but find a pixel by pixel mask of each of the detected objects. We refer to this problem as instance or object segmentation.
Semantic Segmentation is the process of assigning a label to every pixel in the image. This is in stark contrast to classification, where a single label is assigned to the entire picture. Semantic segmentation treats multiple objects of the same class as a single entity. On the other hand, instance segmentation treats multiple objects of the same class as distinct individual objects (or instances). Typically, instance segmentation is harder than semantic segmentation.
In order to perform semantic segmentation, a higher level understanding of the image is required. The algorithm should figure out the objects present and also the pixels which correspond to the object. Semantic segmentation is one of the essential tasks for complete scene understanding. This can be used in analysis of medical images and satellite images. Again, the VOC 2012 and MS COCO datasets can be used for object segmentation.
Modern image segmentation techniques are powered by deep learning technology. Here are several deep learning architectures used for segmentation.
Convolutional Neural Networks (CNNs)
Image segmentation with CNN involves feeding segments of an image as input to a convolutional neural network, which labels the pixels. The CNN cannot process the whole image at once. It scans the image, looking at a small “filter” of several pixels each time until it has mapped the entire image. To learn more see our in-depth guide about Convolutional Neural Networks.
Traditional CNNs have fully-connected layers, which can’t manage different input sizes. FCNs use convolutional layers to process varying input sizes and can work faster. The final output layer has a large receptive field and corresponds to the height and width of the image, while the number of channels corresponds to the number of classes. The convolutional layers classify every pixel to determine the context of the image, including the location of objects.
One main motivation for DeepLab is to perform image segmentation while helping control signal decimation—reducing the number of samples and the amount of data that the network must process. Another motivation is to enable multi-scale contextual feature learning—aggregating features from images at different scales. DeepLab uses an ImageNet pre-trained residual neural network (ResNet) for feature extraction. DeepLab uses atrous (dilated) convolutions instead of regular convolutions. The varying dilation rates of each convolution enable the ResNet block to capture multi-scale contextual information. DeepLab comprises three components:
An architecture based on deep encoders and decoders is also known as semantic pixel-wise segmentation. It involves encoding the input image into low dimensions and then recovering it with orientation invariance capabilities in the decoder. This generates a segmented image at the decoder end.
In this post we have discussed some applications of computer vision including Image Localization, Object Detection and Image Segmentation. We then discussed required deep learning architectures which are used for the above applications.