CNN Architectures: Datasets, LeNet, AlexNet, VGG
72 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a primary characteristic of the MNIST dataset?

  • It includes localization and detection data.
  • It contains color images.
  • It consists of black and white images of handwritten digits. (correct)
  • It is mainly used for classifying clothing items.

What distinguishes Fashion MNIST from the original MNIST dataset?

  • Fashion MNIST contains only numerical data.
  • Fashion MNIST includes images of clothing items. (correct)
  • Fashion MNIST has a higher resolution.
  • Fashion MNIST is used for object detection, not classification.

What is the total number of images in the CIFAR dataset?

  • 10,000
  • 100,000
  • 60,000 (correct)
  • 50,000

What is the primary focus of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)?

<p>Advancing image classification and object detection techniques. (D)</p> Signup and view all the answers

Which of the following best describes the COCO dataset?

<p>A dataset for object detection, keypoints, and captions. (B)</p> Signup and view all the answers

What is the primary task for which the LeNet-5 architecture was designed?

<p>Handwritten digit classification. (B)</p> Signup and view all the answers

Which characteristic is associated with LeNet-5's architecture?

<p>Application of max pooling for downsampling. (A)</p> Signup and view all the answers

What differentiates LeNet-5 from more modern CNN architectures in terms of connectivity between feature maps?

<p>LeNet-5 uses sparse connectivity, selectively connecting feature maps to encourage learning diverse features. (C)</p> Signup and view all the answers

What is the main purpose of max pooling in CNNs?

<p>To reduce the computational load and extract dominant features. (A)</p> Signup and view all the answers

Which of the following is an advantage of using the Tanh activation function over the Sigmoid activation function?

<p>Tanh is symmetric around zero, which may lead to faster convergence. (A)</p> Signup and view all the answers

How does the ReLU activation function address the vanishing gradient problem?

<p>By outputting the input directly if it is positive, preventing saturation. (D)</p> Signup and view all the answers

What is a key architectural difference that distinguishes AlexNet from LeNet-5?

<p>AlexNet involves 'parallel' networks and more parameters. (A)</p> Signup and view all the answers

Which activation function was used in AlexNet to improve training performance compared to earlier models using Tanh or Sigmoid?

<p>ReLU (C)</p> Signup and view all the answers

What is the purpose of Dropout in the AlexNet architecture?

<p>To prevent overfitting by randomly dropping connections during training. (D)</p> Signup and view all the answers

What is a key architectural characteristic of VGG networks?

<p>The utilization of smaller convolutional filters (3x3) in deeper networks. (A)</p> Signup and view all the answers

What problem do Batch Normalization layers address in deep neural networks?

<p>Vanishing/exploding gradients, improving convergence during training. (C)</p> Signup and view all the answers

What is the primary purpose of residual connections in Residual Networks?

<p>To allow the network to learn an identity function, aiding the training of very deep networks. (A)</p> Signup and view all the answers

What is the key idea behind autoencoders?

<p>To compress input data into a lower-dimensional space and then reconstruct it. (B)</p> Signup and view all the answers

What distinguishes undercomplete autoencoders from overcomplete autoencoders?

<p>Undercomplete autoencoders have fewer dimensions in the encoding than the input, encouraging learning of salient features; overcomplete have same or more dimensions. (D)</p> Signup and view all the answers

What is a primary application of autoencoders?

<p>Dimensionality reduction and denoising of images or signals. (C)</p> Signup and view all the answers

In the context of autoencoders, what does balancing sensitivity and insensitivity refer to?

<p>Balancing the trade-off between reconstruction accuracy and overfitting. (C)</p> Signup and view all the answers

What is the purpose of skip connections in U-Nets?

<p>To concatenate feature maps from the encoding path to the decoding path, preserving local detail. (C)</p> Signup and view all the answers

For what type of task are U-Nets commonly used?

<p>Biomedical image segmentation. (C)</p> Signup and view all the answers

What is the task of image translation?

<p>Converting an image from one domain to another (e.g., edges to photo). (C)</p> Signup and view all the answers

What is the primary goal of image inpainting?

<p>To restore missing or damaged parts of an image. (C)</p> Signup and view all the answers

In the R-CNN family of object detection models, what is the initial step?

<p>Generating region proposals. (D)</p> Signup and view all the answers

Objectness, Category-independent object proposals, and Selective search are all examples of what?

<p>Region proposal algorithms. (B)</p> Signup and view all the answers

What is a key difference between R-CNN and Fast R-CNN?

<p>R-CNN extracts features for each region proposal independently, while Fast R-CNN feeds the entire image into a CNN once and then extracts features. (C)</p> Signup and view all the answers

What is the purpose of Intersection over Union (IoU) in object detection?

<p>To measure the overlap between the predicted bounding box and the ground truth bounding box. (C)</p> Signup and view all the answers

What is Non-Maximum Suppression (NMS) used for in object detection?

<p>To remove redundant, overlapping bounding boxes predicting the same object. (D)</p> Signup and view all the answers

What is a primary drawback of R-CNN?

<p>Its high computational cost and slow processing speed. (A)</p> Signup and view all the answers

In Faster R-CNN, what replaces the selective search algorithm used in R-CNN for generating region proposals?

<p>A convolutional neural network (CNN). (C)</p> Signup and view all the answers

What is the main advantage of Faster R-CNN over Fast R-CNN?

<p>Faster R-CNN integrates region proposal directly into the network, rather than depending upon other external algorithms. (C)</p> Signup and view all the answers

What does 'precision' measure in the context of evaluating object detection models?

<p>The proportion of detected objects that are actually correct. (C)</p> Signup and view all the answers

What does 'recall' measure in the context of evaluating object detection models?

<p>The proportion of actual objects that are correctly detected by the model. (D)</p> Signup and view all the answers

What are the general characteristics of YOLO?

<p>Fast computation, lower accuracy. (C)</p> Signup and view all the answers

How does Single Shot Detector (SSD) compare to YOLO in object detection?

<p>SSD predicts object detections at multiple scales, while YOLO only predicts at a single scale. (D)</p> Signup and view all the answers

In the context of evaluating a object detection system, how are true postives (TP), false positives (FP) marked?

<p>Detections where IoU &gt; thresh2 (B)</p> Signup and view all the answers

In the general Batch Normalisation equation, what is the function of these variables: γ, β.

<p>They are learnable parameters that scale and shift the normalised data. (B)</p> Signup and view all the answers

Which is the correct Batch Normalisation equation from these options?

<p>γ * ((x - μβ) / √(σβ^2 + ε)) + β (A)</p> Signup and view all the answers

What is a key characteristic of the Fashion MNIST dataset that distinguishes it from the original MNIST?

<p>It contains images of clothing items across 10 different classes. (D)</p> Signup and view all the answers

What is the size of images in the CIFAR-10 dataset?

<p>32 x 32 x 3 (D)</p> Signup and view all the answers

What type of data, other than labeled images, does the ImageNet dataset also contain?

<p>Localization and detection data and short video clips. (B)</p> Signup and view all the answers

What is the approximate number of object instances in the COCO dataset?

<p>1.5 million (D)</p> Signup and view all the answers

What type of loss function did the LeNet-5 architecture utilize?

<p>MSE loss (B)</p> Signup and view all the answers

What are some of the reasons that LeNet-5 did not have all feature maps fully connected between layers?

<p>Larger computational cost and breaking of symmetry. (D)</p> Signup and view all the answers

What is the primary advantage of using max pooling in neural networks?

<p>Reducing resolution while preserving activated features and decreasing computational cost. (D)</p> Signup and view all the answers

What key architectural innovation did AlexNet introduce, enabling more efficient GPU parallelization during training?

<p>Two 'parallel' networks (B)</p> Signup and view all the answers

What is the primary goal of using dropout during the training phase in AlexNet?

<p>To prevent overfitting by randomly dropping connections. (C)</p> Signup and view all the answers

Which of the following is the main advantage of using smaller convolutional filters (e.g., 3x3) in VGG networks?

<p>To reduce the number of parameters while maintaining a high representational capacity. (B)</p> Signup and view all the answers

How does Batch Normalization contribute to training deeper neural networks?

<p>By allowing higher learning rates due to more stable gradients. (C)</p> Signup and view all the answers

What is the most significant advantage of using residual connections in deep neural networks?

<p>Enabling the training of much deeper networks by alleviating the vanishing gradient problem. (C)</p> Signup and view all the answers

What is the key characteristic of an undercomplete autoencoder?

<p>The encoding has fewer dimensions than the input, forcing it to learn a compressed representation. (B)</p> Signup and view all the answers

How do autoencoders achieve a balance between sensitivity and insensitivity?

<p>By optimizing a reconstruction loss while regularising to prevent memorisation or overfitting. (B)</p> Signup and view all the answers

What is the main purpose of skip connections in U-Nets in the context of image segmentation?

<p>To combine feature maps to recover local detail lost during downsampling. (D)</p> Signup and view all the answers

What is the general goal of 'image translation' in the context of CNNs?

<p>To convert an image from one representation to another (e.g., labels to a street scene). (C)</p> Signup and view all the answers

What is the use of non-CNN based algorithms in the R-CNN family of object detection models?

<p>To generate region proposals. (C)</p> Signup and view all the answers

What is the main goal of Non-Maximum Suppression (NMS) in object detection tasks?

<p>To discard redundant, overlapping bounding boxes and retain the most accurate ones. (C)</p> Signup and view all the answers

What is a significant limitation of R-CNN that Faster R-CNN addresses?

<p>The reliance on fixed, non-trainable region proposal algorithms. (A)</p> Signup and view all the answers

In the context of object detection, how is the 'precision' of a model defined?

<p>The proportion of true positive detections among all detections made by the model. (B)</p> Signup and view all the answers

What is a characteristic feature of the YOLO (You Only Look Once) object detection system?

<p>It is lightweight, fast, and has lower accuracy. (D)</p> Signup and view all the answers

In the context of evaluating object detection systems, which calculation accurately defines 'recall'?

<p>True Positives / (True Positives + False Negatives) (D)</p> Signup and view all the answers

What is the purpose of optimizing a reconstruction loss function in autoencoders?

<p>To make the reconstructed output as close as possible to the original input. (B)</p> Signup and view all the answers

What does 'image inpainting' primarily involve?

<p>Filling in missing or damaged regions of an image. (B)</p> Signup and view all the answers

How does the Single Shot Detector (SSD) approach object detection, especially when compared to YOLO?

<p>SSD is a one-stage detector, similar to YOLO. (D)</p> Signup and view all the answers

What is the relationship between the number of layers and performance in the plain networks described prior to the introduction of residual networks?

<p>Deeper plain networks saturate in performance and then degrade with more layers. (C)</p> Signup and view all the answers

What is the significance of the Visual Geometry Group (VGG) at the University of Oxford in the context of CNNs?

<p>They developed VGG networks, which won the 2014 ImageNet Challenge. (C)</p> Signup and view all the answers

What are the two learnable parameters which Batch Normalization utilizes?

<p>γ (gamma), β (beta) (C)</p> Signup and view all the answers

Which of the following is the correct function for Batch Normalization? (x is input batch, µ is batch mean, σ is batch variance, γ and β are learnable parameters)

<p>$γ * (x - µ) / √(σ² + ε) + β$ (B)</p> Signup and view all the answers

What is the scaled hyperbolic tangent activation function as used is LeNet-5?

<p>$g(x) = Atanh(Sx)$ (B)</p> Signup and view all the answers

What is a common purpose of the Leaky ReLU activation function?

<p>To reduce the impact of dying ReLU (some neurons never activate). (B)</p> Signup and view all the answers

What is the number of parameters in AlexNet?

<p>60 million parameters (D)</p> Signup and view all the answers

Flashcards

What is the MNIST dataset?

Is a dataset with modified NIST database for handwritten digit recognition, containing 60,000 training images and 10,000 testing images.

What is Fashion MNIST?

A dataset similar to MNIST but contains images of clothing items instead of digits. MNIST is considered overused and too easy.

What is the CIFAR Dataset?

A dataset from the Canadian Institute, containing 60,000 images total with 50,000 training images and 10,000 testing images. Images are 32x32 pixels and split into either 10 or 100 classes.

What is the ImageNet Dataset?

A dataset with 15 million labeled high-resolution images that has localization and detection data and video, and is used in the yearly ILSVRC competition.

Signup and view all the flashcards

What is the COCO dataset?

A dataset related to object detection, keypoints, and captions, containing 330,000 imges, 80 object categories, and 1.5 million object instances.

Signup and view all the flashcards

What is LeNet-5?

CNN Architecture that classifies handwritten digits (0-9) and automatically digitizes document letters and books. Achieved 0.7% error rate.

Signup and view all the flashcards

What is Max Pooling?

Reducing the resolution of a feature map by taking the largest value of a given window size.

Signup and view all the flashcards

What is the Tanh nonlinearity?

A hyperbolic tangent function, centered around zero with larger gradients that slightly improve training over sigmoid.

Signup and view all the flashcards

What is ReLU?

A type of activation function which outputs x if x is positive, otherwise zero. Faster to compute than sigmoid.

Signup and view all the flashcards

What is AlexNet?

CNN Architecture much larger than LeNet-5, having 60 million parameters, leveraging two parallel networks, 5 convolutional layers, and 3 fully connected layers.

Signup and view all the flashcards

What is Dropout?

Prevents overfitting and greatly improves test set performance by randomly dropping connections during training.

Signup and view all the flashcards

What is VGG?

CNN Architecture deeper than AlexNet with 16-19 layers, uses smaller convolutional filters (3x3), and has multiple configurations presented.

Signup and view all the flashcards

What is Batch Normalisation?

A machine learning regularization approach that normalizes the inputs of mini-batches to have the same distribution each layer to improve deep network training.

Signup and view all the flashcards

What are Residual Networks?

A neural network architecture in which connections are added from the input of a layer to the output of a later layer. It solves the problem of vanishing/exploding gradients observed in deep networks.

Signup and view all the flashcards

What are Autoencoders?

A type of neural network architecture used to map input to a different space then back with encoding and decoding components. Can be used for dimensionality reduction compression.

Signup and view all the flashcards

What are U-Nets?

A type of autoencoder used for biomedical image segmentation that uses skip connections to add finer details. Helps preseve local detail.

Signup and view all the flashcards

What is Image Translation?

The task of transforming an image from one domain to another, like converting a black and white photo to color or an aerial image to a map.

Signup and view all the flashcards

What is the R-CNN family?

A family of object detection architectures to locate and classify items in an image by localizing the objects in bounding boxes or segmentations.

Signup and view all the flashcards

What is Non-Maximum Suppression?

An approach used to eleminate multiple bounding boxes around the same object by comparing which box has the highest classification score.

Signup and view all the flashcards

What is Intersection over Union (IoU)?

A way to measure image detection precision defined by area of intersection divided by area of union.

Signup and view all the flashcards

What is Fast R-CNN?

A object classification that involves inputting an entire image and set of object proposals to extract a fixed length feature vector from the feature map.

Signup and view all the flashcards

What is YOLO?

Real–time object detection, applying a single neural network to the full image, partitioning the image into regions and predicting bounding boxes and probabilities for each region.

Signup and view all the flashcards

Study Notes

  • CNN Architectures topic includes Datasets, LeNet-5, AlexNet, VGG, Residual Networks, Autoencoders / U-Net, Detectors, and Information Theory.

MNIST Dataset

  • Modified National Institute of Standards and Technology database is used.
  • It contains 60,000 training images and 10,000 testing images.
  • Images are black and white, represented as 0 or 1.

Fashion MNIST

  • Similar to MNIST but contains images of clothing items.
  • It includes ten classes such as dress, coat, and shirt.
  • MNIST is considered overused and "too easy".

CIFAR Dataset

  • Maintained by the Canadian Institute for Advanced Research.
  • It consists of overall 60,000 images like airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.
  • The 60,000 images are divided into 50,000 training and 10,000 testing images.
  • Image size is 32 x 32 x 3.
  • CIFAR-10 is split into total 10 classes.
  • CIFAR-100 is split into total 100 classes.

ImageNet Dataset

  • Contains 15 million labelled high-resolution images.
  • Includes localization, detection data, and video.
  • The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an yearly contest.
  • CNNs have dominated the top positions since 2012.

COCO Dataset

  • Tasks include Detection, Captions and Keypoints.
  • It consists of 330,000 images with 80 object categories
  • Has 1.5 million object instances.

LeNet-5

  • Used for handwritten digit (0-9) classification (MNIST).
  • It could be extended to classify handwritten characters (a-z).
  • Can automatically digitize documents, letters, and books
  • Developed in 1998.
  • While not the first solution, with alternatives like K-NN, PCA + quadratic, and SVM, it was the first to reach a 0.7% error rate.
  • The LeNet-5 architecture includes an input layer of size 32x32, followed by convolutional layers, subsampling layers, and fully connected layers.
  • It comprised two convolutional layers, 2 fully connected layers and 1 fully connected layer, further MSE loss.
  • Max pooling is used to downsample twice.
  • Output has 10 dimensions, one for each digit.
  • It has approximately is ~60,000 parameters.
  • Uses scaled hyperbolic tangent as the active function: g(x) = Atanh(Sx).
  • Not all feature maps are fully connected between layers because of lower computational cost and break of symmetry.
  • Ignoring this is now more for ease of implementation.

Max Pooling

  • Max Pooling is used to reduce resolution, leading to less memory usage and faster processing.
  • It propagates the most "activated" features by removing noisy features.

Tanh Nonlinearity

  • Sigmoid function maps inputs to a range of [0, 1]: s(x) = 1 / (1 + e^-x).
  • The hyperbolic tangent function maps inputs to a range of [-1, 1]: tanh(x) = (e^x - e^-x) / (e^x + e^-x)
  • Scaled and stretched sigmoid: tanh(x) = 2s(2x) - 1.
  • Hyperbolic Tangent is slightly preferred over sigmoid due to symmetry around 0.

ReLU

  • ReLU (Rectified Linear Unit) is defined as ReLU(x) = max(0, x).
  • It mitigates vanishing gradients
  • Leaky ReLU: defined as LReLU(x) = x if x ≥ 0, and ax if x < 0.
  • ELU ( Exponential Linear Unit) function: ELU(x) = x if x ≥ 0 and a(e^x – 1), if x < 0

AlexNet

  • It is designed as Much larger network than LeNet5.
  • Has 60 million parameters.
  • Uses two "parallel” networks for more efficient GPU parallelization.
  • It contains of 5 convolutional layers, and 3 fully connected layers.
  • It has the training time for 5 - 6 days with two GTX 580 GPUs.
  • Used the ImageNet training set, comprising 1.2 million color images across 1000 classes.
  • Inputs are sized at 224 x 224 x 3.
  • Top-1 error rate stood at 37.5%, while the top-5 error rate was 17.0%
  • AlexNet uses ReLU nonlinearity (activation function).
  • The visualizations of the first layer filters show smooth, converged traditional filters with gradients, checkerboards, and blurs

Dropout

  • Used in AlexNet to prevent overfitting and greatly improve test set performance.
  • Randomly "drops" connections during training
  • Uses all connections at inference time (val. and test).

ILSVRC Performance

  • Describes the classification errors and average precision in object detection over the years on the ILSVRC competition.

VGG

  • Developed by Visual Geometry Group, University of Oxford.
  • VGG is a Deeper architecture than AlexNet.
  • Has 16 - 19 layers.
  • Was the Winner of 2014 ImageNet Challenge.
  • It uses smaller convolutional filters (3x3).
  • The learned representations generalize well.
  • Has multiple configurations which range from 133 - 144 million parameters.
  • The architecture incorporates ReLU activation and uses SGD with momentum.
  • Dropout is used to avoid overfitting and glorot Initialization is used for weight initialization.
  • Also have the weight Decay (L2 regularization).

Deep Neural Networks

  • Are challenging to train because the input from prior layers change after weight updates.
  • Techniques that help include Residual Networks and Batch normalisation.

Batch Normalisation

  • Driven by significance of network depth.
  • An obstacle to answering this question was the notorious problem of vanishing/exploding gradients, which hamper convergence.
  • It allows SGD to converge and works fine for tens of layers.

Residual Networks

  • Introduces "residual” connections lres = x + F(x).
  • It was Winner of 2015 ILSVRC (ensemble).
  • Easier to train.
  • Achieves better propagation of gradients.
  • Has deeper networks, with configurations like 152 layers for ImageNet or Up to 1000 layers for CIFAR 10.

Autoencoders

  • Map input to a different space and back.
  • Undercomplete autoencoders have fewer dimensions by dimensionality reduction, compression.
  • Overcomplete autoencoders: have same size (or larger) dimensions by feature learning, denoising images / signals.
  • The equation is represented as: f(x) and reconstruction as x = g(y) = g(f(x)).
  • The ideal autoencoder model balances sensitivity to inputs for accurate reconstruction with insensitivity to avoid memorization/overfitting of the training data.
  • Optimise with a reconstruction loss: L(x, x) with a regulariser to discourage memorisation.
  • Models can learn key data attributes like the most important features.

U-Nets

  • Autoencoders with skip connections
  • Feature maps are concatenated.
  • It leads Local detail is better preserved
  • Initially employed for biomedical image segmentation.
  • Extensively adaptable to various imaging problems.

R-CNN Family

  • Tasks includes Object Detection which localizing items and localizing box segment and classifying.
  • Used on Datasets such as PASCAL VOC, MS COCO, Cityscapes and KITTI.

R-CNN

  • Finds Regions of Interest, an image is “scan” with potentially 2,073,600 evaluations using the Full HD image that could be done by CNN
  • Then these areas are classified using SVM and CNN with the method R-CNN
  • It is not based on algorithms. It uses objectness, selective search, and Category-independent object proposals
  • It uses CNN to extract image feature.
  • It finds possible objects, and then scales them to a consistent aspect ratio.
  • It uses a CNN is used to classify objects.

IoU: Intersection over Union

  • This helps to compare bounding boxes.
  • IoU indicates how much overlap the ground truth bounding box has.
  • A result of 1 means a perfect prediction, 0 vice versa.

NMS: Non-Maximum Suppression

  • Used when the predicted regions often overlap.
  • The R-CNN predicts classifications for 2000 Rols.
  • Discrads the boxes with low classification confidence
  • Greedily selects boxes depending on the score assigned.

Problems with R-CNN

  • R-CNN needs lots of time to train due to thousands of iterations.
  • The selective search algorithm is fixed, requiring excessive amount of time to train.
  • This means it cannot be implemented real time
  • Requires around 47 seconds for testing for each image.

Fast R-CNN

  • It intakes an entire image with a lot of object proposals
  • During the object proposal step to detect an area a region of inter, that is known as the of (RoI) pooling layer extracts a fixed-length feature vector from the feature map
  • Training is end to end as the whole model is differentiable
  • The training becomes faster 8.75hr vs 84 hr as well as the inference rate 0.32s vs 47s than the previous iterations that R-CNN used to complete.
  • Improved precision.

Faster R-CNN

  • Introduces Region Proposal Network
  • Replaces Region of Interest (Rol) proposal algorithms with CNN
  • It is a trainable proposal method and that uses a binary classifier to predict whether an object is present or not.

Evaluation

  • Set two thresholds, Confidence which the model has correctly classified objects or sections and IoU intersection of unit whether this can be localized well.
  • To compare two different types of recall in terms of precision and *IOU. * Precision = Good Predictions / All Predictions
  • *Recall = Good Predictions / All Targets_

YOLO

  • Stands for You Only Look Once
  • Is lightweight and very Fast.
  • has Lower accuracy.

Other CNN Detectors

  • Includes Single Shot Detector (SSD), similar to YOLO

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Explore CNN architectures, datasets like MNIST, Fashion MNIST, CIFAR & ImageNet. Discusses LeNet-5, AlexNet, VGG, Residual Networks, Autoencoders/U-Net, Detectors, and Information Theory. Understand the structure and applications of each dataset.

More Like This

Mastering CNN Architectures
10 questions
CNN Website User Experience Feedback Quiz
5 questions
Introduction to CNN Image Challenges Quiz
30 questions
CNN News Quiz Flashcards
16 questions

CNN News Quiz Flashcards

AmicableNeodymium avatar
AmicableNeodymium
Use Quizgecko on...
Browser
Browser