Podcast
Questions and Answers
What is a primary characteristic of the MNIST dataset?
What is a primary characteristic of the MNIST dataset?
- It includes localization and detection data.
- It contains color images.
- It consists of black and white images of handwritten digits. (correct)
- It is mainly used for classifying clothing items.
What distinguishes Fashion MNIST from the original MNIST dataset?
What distinguishes Fashion MNIST from the original MNIST dataset?
- Fashion MNIST contains only numerical data.
- Fashion MNIST includes images of clothing items. (correct)
- Fashion MNIST has a higher resolution.
- Fashion MNIST is used for object detection, not classification.
What is the total number of images in the CIFAR dataset?
What is the total number of images in the CIFAR dataset?
- 10,000
- 100,000
- 60,000 (correct)
- 50,000
What is the primary focus of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)?
What is the primary focus of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)?
Which of the following best describes the COCO dataset?
Which of the following best describes the COCO dataset?
What is the primary task for which the LeNet-5 architecture was designed?
What is the primary task for which the LeNet-5 architecture was designed?
Which characteristic is associated with LeNet-5's architecture?
Which characteristic is associated with LeNet-5's architecture?
What differentiates LeNet-5 from more modern CNN architectures in terms of connectivity between feature maps?
What differentiates LeNet-5 from more modern CNN architectures in terms of connectivity between feature maps?
What is the main purpose of max pooling in CNNs?
What is the main purpose of max pooling in CNNs?
Which of the following is an advantage of using the Tanh activation function over the Sigmoid activation function?
Which of the following is an advantage of using the Tanh activation function over the Sigmoid activation function?
How does the ReLU activation function address the vanishing gradient problem?
How does the ReLU activation function address the vanishing gradient problem?
What is a key architectural difference that distinguishes AlexNet from LeNet-5?
What is a key architectural difference that distinguishes AlexNet from LeNet-5?
Which activation function was used in AlexNet to improve training performance compared to earlier models using Tanh or Sigmoid?
Which activation function was used in AlexNet to improve training performance compared to earlier models using Tanh or Sigmoid?
What is the purpose of Dropout in the AlexNet architecture?
What is the purpose of Dropout in the AlexNet architecture?
What is a key architectural characteristic of VGG networks?
What is a key architectural characteristic of VGG networks?
What problem do Batch Normalization layers address in deep neural networks?
What problem do Batch Normalization layers address in deep neural networks?
What is the primary purpose of residual connections in Residual Networks?
What is the primary purpose of residual connections in Residual Networks?
What is the key idea behind autoencoders?
What is the key idea behind autoencoders?
What distinguishes undercomplete autoencoders from overcomplete autoencoders?
What distinguishes undercomplete autoencoders from overcomplete autoencoders?
What is a primary application of autoencoders?
What is a primary application of autoencoders?
In the context of autoencoders, what does balancing sensitivity and insensitivity refer to?
In the context of autoencoders, what does balancing sensitivity and insensitivity refer to?
What is the purpose of skip connections in U-Nets?
What is the purpose of skip connections in U-Nets?
For what type of task are U-Nets commonly used?
For what type of task are U-Nets commonly used?
What is the task of image translation?
What is the task of image translation?
What is the primary goal of image inpainting?
What is the primary goal of image inpainting?
In the R-CNN family of object detection models, what is the initial step?
In the R-CNN family of object detection models, what is the initial step?
Objectness, Category-independent object proposals, and Selective search are all examples of what?
Objectness, Category-independent object proposals, and Selective search are all examples of what?
What is a key difference between R-CNN and Fast R-CNN?
What is a key difference between R-CNN and Fast R-CNN?
What is the purpose of Intersection over Union (IoU) in object detection?
What is the purpose of Intersection over Union (IoU) in object detection?
What is Non-Maximum Suppression (NMS) used for in object detection?
What is Non-Maximum Suppression (NMS) used for in object detection?
What is a primary drawback of R-CNN?
What is a primary drawback of R-CNN?
In Faster R-CNN, what replaces the selective search algorithm used in R-CNN for generating region proposals?
In Faster R-CNN, what replaces the selective search algorithm used in R-CNN for generating region proposals?
What is the main advantage of Faster R-CNN over Fast R-CNN?
What is the main advantage of Faster R-CNN over Fast R-CNN?
What does 'precision' measure in the context of evaluating object detection models?
What does 'precision' measure in the context of evaluating object detection models?
What does 'recall' measure in the context of evaluating object detection models?
What does 'recall' measure in the context of evaluating object detection models?
What are the general characteristics of YOLO?
What are the general characteristics of YOLO?
How does Single Shot Detector (SSD) compare to YOLO in object detection?
How does Single Shot Detector (SSD) compare to YOLO in object detection?
In the context of evaluating a object detection system, how are true postives (TP), false positives (FP) marked?
In the context of evaluating a object detection system, how are true postives (TP), false positives (FP) marked?
In the general Batch Normalisation equation, what is the function of these variables: γ, β.
In the general Batch Normalisation equation, what is the function of these variables: γ, β.
Which is the correct Batch Normalisation equation from these options?
Which is the correct Batch Normalisation equation from these options?
What is a key characteristic of the Fashion MNIST dataset that distinguishes it from the original MNIST?
What is a key characteristic of the Fashion MNIST dataset that distinguishes it from the original MNIST?
What is the size of images in the CIFAR-10 dataset?
What is the size of images in the CIFAR-10 dataset?
What type of data, other than labeled images, does the ImageNet dataset also contain?
What type of data, other than labeled images, does the ImageNet dataset also contain?
What is the approximate number of object instances in the COCO dataset?
What is the approximate number of object instances in the COCO dataset?
What type of loss function did the LeNet-5 architecture utilize?
What type of loss function did the LeNet-5 architecture utilize?
What are some of the reasons that LeNet-5 did not have all feature maps fully connected between layers?
What are some of the reasons that LeNet-5 did not have all feature maps fully connected between layers?
What is the primary advantage of using max pooling in neural networks?
What is the primary advantage of using max pooling in neural networks?
What key architectural innovation did AlexNet introduce, enabling more efficient GPU parallelization during training?
What key architectural innovation did AlexNet introduce, enabling more efficient GPU parallelization during training?
What is the primary goal of using dropout during the training phase in AlexNet?
What is the primary goal of using dropout during the training phase in AlexNet?
Which of the following is the main advantage of using smaller convolutional filters (e.g., 3x3) in VGG networks?
Which of the following is the main advantage of using smaller convolutional filters (e.g., 3x3) in VGG networks?
How does Batch Normalization contribute to training deeper neural networks?
How does Batch Normalization contribute to training deeper neural networks?
What is the most significant advantage of using residual connections in deep neural networks?
What is the most significant advantage of using residual connections in deep neural networks?
What is the key characteristic of an undercomplete autoencoder?
What is the key characteristic of an undercomplete autoencoder?
How do autoencoders achieve a balance between sensitivity and insensitivity?
How do autoencoders achieve a balance between sensitivity and insensitivity?
What is the main purpose of skip connections in U-Nets in the context of image segmentation?
What is the main purpose of skip connections in U-Nets in the context of image segmentation?
What is the general goal of 'image translation' in the context of CNNs?
What is the general goal of 'image translation' in the context of CNNs?
What is the use of non-CNN based algorithms in the R-CNN family of object detection models?
What is the use of non-CNN based algorithms in the R-CNN family of object detection models?
What is the main goal of Non-Maximum Suppression (NMS) in object detection tasks?
What is the main goal of Non-Maximum Suppression (NMS) in object detection tasks?
What is a significant limitation of R-CNN that Faster R-CNN addresses?
What is a significant limitation of R-CNN that Faster R-CNN addresses?
In the context of object detection, how is the 'precision' of a model defined?
In the context of object detection, how is the 'precision' of a model defined?
What is a characteristic feature of the YOLO (You Only Look Once) object detection system?
What is a characteristic feature of the YOLO (You Only Look Once) object detection system?
In the context of evaluating object detection systems, which calculation accurately defines 'recall'?
In the context of evaluating object detection systems, which calculation accurately defines 'recall'?
What is the purpose of optimizing a reconstruction loss function in autoencoders?
What is the purpose of optimizing a reconstruction loss function in autoencoders?
What does 'image inpainting' primarily involve?
What does 'image inpainting' primarily involve?
How does the Single Shot Detector (SSD) approach object detection, especially when compared to YOLO?
How does the Single Shot Detector (SSD) approach object detection, especially when compared to YOLO?
What is the relationship between the number of layers and performance in the plain networks described prior to the introduction of residual networks?
What is the relationship between the number of layers and performance in the plain networks described prior to the introduction of residual networks?
What is the significance of the Visual Geometry Group (VGG) at the University of Oxford in the context of CNNs?
What is the significance of the Visual Geometry Group (VGG) at the University of Oxford in the context of CNNs?
What are the two learnable parameters which Batch Normalization utilizes?
What are the two learnable parameters which Batch Normalization utilizes?
Which of the following is the correct function for Batch Normalization? (x is input batch, µ is batch mean, σ is batch variance, γ and β are learnable parameters)
Which of the following is the correct function for Batch Normalization? (x is input batch, µ is batch mean, σ is batch variance, γ and β are learnable parameters)
What is the scaled hyperbolic tangent activation function as used is LeNet-5?
What is the scaled hyperbolic tangent activation function as used is LeNet-5?
What is a common purpose of the Leaky ReLU activation function?
What is a common purpose of the Leaky ReLU activation function?
What is the number of parameters in AlexNet?
What is the number of parameters in AlexNet?
Flashcards
What is the MNIST dataset?
What is the MNIST dataset?
Is a dataset with modified NIST database for handwritten digit recognition, containing 60,000 training images and 10,000 testing images.
What is Fashion MNIST?
What is Fashion MNIST?
A dataset similar to MNIST but contains images of clothing items instead of digits. MNIST is considered overused and too easy.
What is the CIFAR Dataset?
What is the CIFAR Dataset?
A dataset from the Canadian Institute, containing 60,000 images total with 50,000 training images and 10,000 testing images. Images are 32x32 pixels and split into either 10 or 100 classes.
What is the ImageNet Dataset?
What is the ImageNet Dataset?
Signup and view all the flashcards
What is the COCO dataset?
What is the COCO dataset?
Signup and view all the flashcards
What is LeNet-5?
What is LeNet-5?
Signup and view all the flashcards
What is Max Pooling?
What is Max Pooling?
Signup and view all the flashcards
What is the Tanh nonlinearity?
What is the Tanh nonlinearity?
Signup and view all the flashcards
What is ReLU?
What is ReLU?
Signup and view all the flashcards
What is AlexNet?
What is AlexNet?
Signup and view all the flashcards
What is Dropout?
What is Dropout?
Signup and view all the flashcards
What is VGG?
What is VGG?
Signup and view all the flashcards
What is Batch Normalisation?
What is Batch Normalisation?
Signup and view all the flashcards
What are Residual Networks?
What are Residual Networks?
Signup and view all the flashcards
What are Autoencoders?
What are Autoencoders?
Signup and view all the flashcards
What are U-Nets?
What are U-Nets?
Signup and view all the flashcards
What is Image Translation?
What is Image Translation?
Signup and view all the flashcards
What is the R-CNN family?
What is the R-CNN family?
Signup and view all the flashcards
What is Non-Maximum Suppression?
What is Non-Maximum Suppression?
Signup and view all the flashcards
What is Intersection over Union (IoU)?
What is Intersection over Union (IoU)?
Signup and view all the flashcards
What is Fast R-CNN?
What is Fast R-CNN?
Signup and view all the flashcards
What is YOLO?
What is YOLO?
Signup and view all the flashcards
Study Notes
- CNN Architectures topic includes Datasets, LeNet-5, AlexNet, VGG, Residual Networks, Autoencoders / U-Net, Detectors, and Information Theory.
MNIST Dataset
- Modified National Institute of Standards and Technology database is used.
- It contains 60,000 training images and 10,000 testing images.
- Images are black and white, represented as 0 or 1.
Fashion MNIST
- Similar to MNIST but contains images of clothing items.
- It includes ten classes such as dress, coat, and shirt.
- MNIST is considered overused and "too easy".
CIFAR Dataset
- Maintained by the Canadian Institute for Advanced Research.
- It consists of overall 60,000 images like airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.
- The 60,000 images are divided into 50,000 training and 10,000 testing images.
- Image size is 32 x 32 x 3.
- CIFAR-10 is split into total 10 classes.
- CIFAR-100 is split into total 100 classes.
ImageNet Dataset
- Contains 15 million labelled high-resolution images.
- Includes localization, detection data, and video.
- The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an yearly contest.
- CNNs have dominated the top positions since 2012.
COCO Dataset
- Tasks include Detection, Captions and Keypoints.
- It consists of 330,000 images with 80 object categories
- Has 1.5 million object instances.
LeNet-5
- Used for handwritten digit (0-9) classification (MNIST).
- It could be extended to classify handwritten characters (a-z).
- Can automatically digitize documents, letters, and books
- Developed in 1998.
- While not the first solution, with alternatives like K-NN, PCA + quadratic, and SVM, it was the first to reach a 0.7% error rate.
- The LeNet-5 architecture includes an input layer of size 32x32, followed by convolutional layers, subsampling layers, and fully connected layers.
- It comprised two convolutional layers, 2 fully connected layers and 1 fully connected layer, further MSE loss.
- Max pooling is used to downsample twice.
- Output has 10 dimensions, one for each digit.
- It has approximately is ~60,000 parameters.
- Uses scaled hyperbolic tangent as the active function: g(x) = Atanh(Sx).
- Not all feature maps are fully connected between layers because of lower computational cost and break of symmetry.
- Ignoring this is now more for ease of implementation.
Max Pooling
- Max Pooling is used to reduce resolution, leading to less memory usage and faster processing.
- It propagates the most "activated" features by removing noisy features.
Tanh Nonlinearity
- Sigmoid function maps inputs to a range of [0, 1]: s(x) = 1 / (1 + e^-x).
- The hyperbolic tangent function maps inputs to a range of [-1, 1]: tanh(x) = (e^x - e^-x) / (e^x + e^-x)
- Scaled and stretched sigmoid: tanh(x) = 2s(2x) - 1.
- Hyperbolic Tangent is slightly preferred over sigmoid due to symmetry around 0.
ReLU
- ReLU (Rectified Linear Unit) is defined as ReLU(x) = max(0, x).
- It mitigates vanishing gradients
- Leaky ReLU: defined as LReLU(x) = x if x ≥ 0, and ax if x < 0.
- ELU ( Exponential Linear Unit) function: ELU(x) = x if x ≥ 0 and a(e^x – 1), if x < 0
AlexNet
- It is designed as Much larger network than LeNet5.
- Has 60 million parameters.
- Uses two "parallel” networks for more efficient GPU parallelization.
- It contains of 5 convolutional layers, and 3 fully connected layers.
- It has the training time for 5 - 6 days with two GTX 580 GPUs.
- Used the ImageNet training set, comprising 1.2 million color images across 1000 classes.
- Inputs are sized at 224 x 224 x 3.
- Top-1 error rate stood at 37.5%, while the top-5 error rate was 17.0%
- AlexNet uses ReLU nonlinearity (activation function).
- The visualizations of the first layer filters show smooth, converged traditional filters with gradients, checkerboards, and blurs
Dropout
- Used in AlexNet to prevent overfitting and greatly improve test set performance.
- Randomly "drops" connections during training
- Uses all connections at inference time (val. and test).
ILSVRC Performance
- Describes the classification errors and average precision in object detection over the years on the ILSVRC competition.
VGG
- Developed by Visual Geometry Group, University of Oxford.
- VGG is a Deeper architecture than AlexNet.
- Has 16 - 19 layers.
- Was the Winner of 2014 ImageNet Challenge.
- It uses smaller convolutional filters (3x3).
- The learned representations generalize well.
- Has multiple configurations which range from 133 - 144 million parameters.
- The architecture incorporates ReLU activation and uses SGD with momentum.
- Dropout is used to avoid overfitting and glorot Initialization is used for weight initialization.
- Also have the weight Decay (L2 regularization).
Deep Neural Networks
- Are challenging to train because the input from prior layers change after weight updates.
- Techniques that help include Residual Networks and Batch normalisation.
Batch Normalisation
- Driven by significance of network depth.
- An obstacle to answering this question was the notorious problem of vanishing/exploding gradients, which hamper convergence.
- It allows SGD to converge and works fine for tens of layers.
Residual Networks
- Introduces "residual” connections lres = x + F(x).
- It was Winner of 2015 ILSVRC (ensemble).
- Easier to train.
- Achieves better propagation of gradients.
- Has deeper networks, with configurations like 152 layers for ImageNet or Up to 1000 layers for CIFAR 10.
Autoencoders
- Map input to a different space and back.
- Undercomplete autoencoders have fewer dimensions by dimensionality reduction, compression.
- Overcomplete autoencoders: have same size (or larger) dimensions by feature learning, denoising images / signals.
- The equation is represented as: f(x) and reconstruction as x = g(y) = g(f(x)).
- The ideal autoencoder model balances sensitivity to inputs for accurate reconstruction with insensitivity to avoid memorization/overfitting of the training data.
- Optimise with a reconstruction loss: L(x, x) with a regulariser to discourage memorisation.
- Models can learn key data attributes like the most important features.
U-Nets
- Autoencoders with skip connections
- Feature maps are concatenated.
- It leads Local detail is better preserved
- Initially employed for biomedical image segmentation.
- Extensively adaptable to various imaging problems.
R-CNN Family
- Tasks includes Object Detection which localizing items and localizing box segment and classifying.
- Used on Datasets such as PASCAL VOC, MS COCO, Cityscapes and KITTI.
R-CNN
- Finds Regions of Interest, an image is “scan” with potentially 2,073,600 evaluations using the Full HD image that could be done by CNN
- Then these areas are classified using SVM and CNN with the method R-CNN
- It is not based on algorithms. It uses objectness, selective search, and Category-independent object proposals
- It uses CNN to extract image feature.
- It finds possible objects, and then scales them to a consistent aspect ratio.
- It uses a CNN is used to classify objects.
IoU: Intersection over Union
- This helps to compare bounding boxes.
- IoU indicates how much overlap the ground truth bounding box has.
- A result of 1 means a perfect prediction, 0 vice versa.
NMS: Non-Maximum Suppression
- Used when the predicted regions often overlap.
- The R-CNN predicts classifications for 2000 Rols.
- Discrads the boxes with low classification confidence
- Greedily selects boxes depending on the score assigned.
Problems with R-CNN
- R-CNN needs lots of time to train due to thousands of iterations.
- The selective search algorithm is fixed, requiring excessive amount of time to train.
- This means it cannot be implemented real time
- Requires around 47 seconds for testing for each image.
Fast R-CNN
- It intakes an entire image with a lot of object proposals
- During the object proposal step to detect an area a region of inter, that is known as the of (RoI) pooling layer extracts a fixed-length feature vector from the feature map
- Training is end to end as the whole model is differentiable
- The training becomes faster 8.75hr vs 84 hr as well as the inference rate 0.32s vs 47s than the previous iterations that R-CNN used to complete.
- Improved precision.
Faster R-CNN
- Introduces Region Proposal Network
- Replaces Region of Interest (Rol) proposal algorithms with CNN
- It is a trainable proposal method and that uses a binary classifier to predict whether an object is present or not.
Evaluation
- Set two thresholds, Confidence which the model has correctly classified objects or sections and IoU intersection of unit whether this can be localized well.
- To compare two different types of recall in terms of precision and *IOU. * Precision = Good Predictions / All Predictions
- *Recall = Good Predictions / All Targets_
YOLO
- Stands for You Only Look Once
- Is lightweight and very Fast.
- has Lower accuracy.
Other CNN Detectors
- Includes Single Shot Detector (SSD), similar to YOLO
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore CNN architectures, datasets like MNIST, Fashion MNIST, CIFAR & ImageNet. Discusses LeNet-5, AlexNet, VGG, Residual Networks, Autoencoders/U-Net, Detectors, and Information Theory. Understand the structure and applications of each dataset.