Computer Vision Model PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document discusses computer vision models, including R-CNN, Fast R-CNN, and Faster R-CNN, and YOLO. Different types of deep neural networks (DNN) used in object detection are highlighted.
Full Transcript
Computer vision model Official (Open) Types of Models Different computer vision models help us answer questions about an image. 1. What objects are in the image? 2. Where are those objects in the image?...
Computer vision model Official (Open) Types of Models Different computer vision models help us answer questions about an image. 1. What objects are in the image? 2. Where are those objects in the image? 3. Where are the key points on an object? 4. What pixels belong to each object? Different types of Deep Neural Network (DNNs) could be customized for an applications to solve problems. NOTE: In general, computer vision model output consists of a label and a confidence or score, which is some estimate of the likelihood of correctly labelling the object. This definition is intentionally vague, as ‘confidence’ will mean very different things for different types of models. Official (Open) CV Models 1. R-CNN 2. Fast R-CNN 3. Faster R-CNN 4. Yolo (different version) Official (Open) Region-Based Convolutional Neural Network (R-CNN) Official (Open) R-CNN 1. this method is to bypass the problem of selecting a huge number of regions. 2. Proposed by Ross Girshick et al. 3. selective search algorithm to extract just 2000 regions from the image - region proposals 4. warped into a square and fed into a convolutional neural network that produces a 4096-dimensional feature vector as output. 5. CNN acts as a feature extractor. Features are variables or attributes in the data set 6. output dense layer consists of the features extracted from the image and the extracted features are fed into a support vector machines SVM to classify and the location of the object of interest Official (Open) Problems with R-CNN 1. It still takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image. 2. It cannot be implemented real time as it takes around 47 seconds for each test image. 3. The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals. Official (Open) Official (Open) Fast R-CNN Official (Open) Fast R-CNN 1. approach is similar to the R-CNN algorithm 2. instead of feeding the region proposals to the CNN, a convolutional feature map is generated 3. region of proposals is identified and warp them into squares 4. a region of interest (RoI) pooling layer will reshape them into fixed size for a fully connected layer 5. a softmax layer is to predict the class of the proposed region and also the offset values for the bounding box. The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it. Official (Open) Comparison of object detection algorithms between R-CNN and Fast R-CNN Official (Open) Faster R-CNN Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network Therefore, Shaoqing Ren et al. came up with an object detection algorithm that eliminates the selective search algorithm and lets the network learn the region proposals Faster R-CNN also used convolutional feature map selective search algorithm not used a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes. Official (Open) Comparison of object detection algorithms between R-CNN, Fast R-CNN and Faster R- CNN Official (Open) YOLO— You Only Look Once 1. All of the previous object detection algorithms use regions to localize the object within the image. 2. The network does not look at the complete image. 3. Instead, parts of the image which have high probabilities of containing the object. 4. YOLO or You Only Look Once is an object detection algorithm much different from the region based algorithms seen above. 5. In YOLO a single convolutional network predicts the bounding boxes and the class probabilities for these boxes. Official (Open) YOLO — You Only Look Once Official (Open) How YOLO works? 1. An image is split into an SxS grid, within each of the grid we take m bounding boxes. 2. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. 3. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image. 4. YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms. 5. The limitation of YOLO algorithm is that it struggles with small objects within the image, for example it might have difficulties in detecting a flock of birds. This is due to the spatial constraints of the algorithm. Official (Open) One stage vs two stage detectors Official (Open) Single-shot object detection 1. Single-shot object detection uses a single pass of the input image to make predictions about the presence and location of objects in the image. It processes an entire image in a single pass, making them computationally efficient. 2. However, single-shot object detection is generally less accurate than other methods, and it’s less effective in detecting small objects. Such algorithms can be used to detect objects in real time in resource-constrained environments. 3. YOLO is a single-shot detector that uses a fully convolutional neural network (CNN) to process an image. We will dive deeper into the YOLO model in the next section. Official (Open) Two-shot object detection 1. Two-shot object detection uses two passes of the input image to make predictions about the presence and location of objects. The first pass is used to generate a set of proposals or potential object locations, and the second pass is used to refine these proposals and make final predictions. This approach is more accurate than single-shot object detection but is also more computationally expensive. 2. Overall, the choice between single-shot and two-shot object detection depends on the specific requirements and constraints of the application. 3. Generally, single-shot object detection is better suited for real-time applications, while two- shot object detection is better for applications where accuracy is more important. Official (Open) What is YOLO? It is an end-to-end neural network that makes predictions of bounding boxes and class probabilities all at once While algorithms like Faster RCNN work by detecting possible regions of interest using the Region Proposal Network and then performing recognition on those regions separately, YOLO performs all of its predictions with the help of a single fully connected layer. Methods that use Region Proposal Networks perform multiple iterations for the same image, while YOLO gets away with a single iteration. Official (Open) YOLO's timeline Official (Open) How does YOLO work? YOLO Architecture Official (Open) YOLO v2 1. Faster, more accurate and able to detect a wider range of object classes than YOLO 2. CNN backbone called Darknet-19, a variant of the VGGNet architecture with simple progressive convolution and pooling layers. 3. Use of anchor boxes (a set of predefined bounding boxes and the predicted offsets to determine the final bounding box. This allow a wider range of object sizes and aspect ratios. 4. Use batch normalization, which helps to improve the accuracy and stability of the model 5. Uses a multi-scale training strategy, which involves training the model on images at multiple scales and then averaging the predictions. This helps to improve the detection performance of small objects. 6. Use a new loss function. It is based on the sum of the squared errors between the predicted and ground truth bounding boxes and class probabilities. Official (Open) YOLO v2 comparison Official (Open) YOLO v3 1. The aim is to increase the accuracy and speed of the algorithm. 2. CNN architecture called Darknet-53. Darknet-53 is a variant of the ResNet architecture and is designed specifically for object detection tasks. It has 53 convolutional layers and is able to achieve state-of-the-art results on various object detection benchmarks. 3. Another improvement in YOLO v3 are anchor boxes with different scales and aspect ratios. In YOLO v2, the anchor boxes were all the same size, which limited the ability of the algorithm to detect objects of different sizes and shapes. In YOLO v3 the anchor boxes are scaled, and aspect ratios are varied to better match the size and shape of the objects being detected. 4. YOLO v3 also introduces the concept of "feature pyramid networks" (FPN). FPNs are a CNN architecture used to detect objects at multiple scales. They construct a pyramid of feature maps, with each level of the pyramid being used to detect objects at a different scale. This helps to improve the detection performance on small objects, as the model is able to see the objects at multiple scales. 5. In addition to these improvements, YOLO v3 can handle a wider range of object sizes and aspect ratios. It is also more accurate and stable than the previous versions of YOLO. Official (Open) YOLO v3 comparison Official (Open) YOLO v4 The primary improvement in YOLO v4 over YOLO v3 is the use of a new CNN architecture called CSPNet (shown below). CSPNet stands for "Cross Stage Partial Network" and is a variant of the ResNet architecture designed specifically for object detection tasks. It has a relatively shallow structure, with only 54 convolutional layers. However, it can achieve state-of-the-art results on various object detection benchmarks. Official (Open) YOLO v4 Both YOLO v3 and YOLO v4 use anchor boxes with different scales and aspect ratios to better match the size and shape of the detected objects. YOLO v4 introduces a new method for generating the anchor boxes, called "k-means clustering." It involves using a clustering algorithm to group the ground truth bounding boxes into clusters and then using the centroids of the clusters as the anchor boxes. This allows the anchor boxes to be more closely aligned with the detected objects' size and shape. While both YOLO v3 and YOLO v4 use a similar loss function for training the model, YOLO v4 introduces a new term called "GHM loss.” It’s a variant of the focal loss function and is designed to improve the model’s performance on imbalanced datasets. YOLO v4 also improves the architecture of the FPNs used in YOLO v3. Official (Open) YOLO v4 comparison Official (Open) YOLO v5 YOLO v5 was introduced in 2020 by the same team that developed the original YOLO algorithm as an open-source project and is maintained by Ultralytics. YOLO v5 builds upon the success of previous versions and adds several new features and improvements. Unlike YOLO, YOLO v5 uses a more complex architecture called EfficientDet (architecture shown below), based on the EfficientNet network architecture. Using a more complex architecture in YOLO v5 allows it to achieve higher accuracy and better generalization to a wider range of object categories. Official (Open) YOLO v5 1. Another difference between YOLO and YOLO v5 is the training data used to learn the object detection model. YOLO was trained on the PASCAL VOC dataset, which consists of 20 object categories. YOLO v5, on the other hand, was trained on a larger and more diverse dataset called D5, which includes a total of 600 object categories. 2. YOLO v5 uses a new method for generating the anchor boxes, called "dynamic anchor boxes." It involves using a clustering algorithm to group the ground truth bounding boxes into clusters and then using the centroids of the clusters as the anchor boxes. This allows the anchor boxes to be more closely aligned with the detected objects' size and shape. 3. YOLO v5 also introduces the concept of "spatial pyramid pooling" (SPP), a type of pooling layer used to reduce the spatial resolution of the feature maps. SPP is used to improve the detection performance on small objects, as it allows the model to see the objects at multiple scales. YOLO v4 also uses SPP, but YOLO v5 includes several improvements to the SPP architecture that allow it to achieve better results. 4. YOLO v4 and YOLO v5 use a similar loss function to train the model. However, YOLO v5 introduces a new term called "CIoU loss," which is a variant of the IoU loss function designed to improve the model's performance on imbalanced datasets. Official (Open) YOLO v5 comparison Official (Open) YOLO v6 YOLO v6 was proposed in 2022 by Li et al. as an improvement over previous versions. One of the main differences between YOLO v5 and YOLO v6 is the CNN architecture used. YOLO v6 uses a variant of the EfficientNet architecture called EfficientNet-L2. It’s a more efficient architecture than EfficientDet used in YOLO v5, with fewer parameters and a higher computational efficiency. It can achieve state-of-the-art results on various object detection benchmarks. The framework of the YOLO v6 model is shown below. YOLO v6 also introduces a new method for generating the anchor boxes, called "dense anchor boxes." Official (Open) YOLO v6 comparison Official (Open) YOLO v7 YOLO v7 uses nine anchor boxes, which allows it to detect a wider range of object shapes and sizes compared to previous versions, thus helping to reduce the number of false positives. A key improvement in YOLO v7 is the use of a new loss function called “focal loss.” Previous versions of YOLO used a standard cross-entropy loss function, which is known to be less effective at detecting small objects. Focal loss battles this issue by down-weighting the loss for well-classified examples and focusing on the hard examples—the objects that are hard to detect. YOLO v7 also has a higher resolution than the previous versions. It processes images at a resolution of 608 by 608 pixels, which is higher than the 416 by 416 resolution used in YOLO v3. This higher resolution allows YOLO v7 to detect smaller objects and to have a higher accuracy overall. One of the main advantages of YOLO v7 is its speed. It can process images at a rate of 155 frames per second, much faster than other state-of-the-art object detection algorithms. Even the original baseline YOLO model was capable of processing at a maximum rate of 45 frames per second. This makes it suitable for sensitive real-time applications such as surveillance and self-driving cars, where higher processing speeds are crucial. Official (Open) YOLO v7 comparison Official (Open) Limitations of YOLO v7 1. YOLO v7, like many object detection algorithms, struggles to detect small objects. It might fail to accurately detecting objects in crowded scenes or when objects are far away from the camera. 2. YOLO v7 is also not perfect at detecting objects at different scales. This can make it difficult to detect objects that are either very large or very small compared to the other objects in the scene. 3. YOLO v7 can be sensitive to changes in lighting or other environmental conditions, so it may be inconvenient to use in real-world applications where lighting conditions may vary. 4. YOLO v7 can be computationally intensive, which can make it difficult to run in real-time on resource-constrained devices like smartphones or other edge devices. Official (Open) Code It is a must to use ultralytics !pip install ultralytics from ultralytics import YOLO Just load the model YOLO to use # Load a model model = YOLO('yolov8n.pt’) Note to tutor: Please explain how to use the model function and YOLO in coding. Official (Open) The End