Untitled Quiz
50 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What happens when both 𝑛 and 𝑑 are large in the context of feature mapping?

  • 𝜙(𝒙) becomes very large and expensive to deal with (correct)
  • 𝜙(𝒙) becomes irrelevant
  • 𝜙(𝒙) remains small and manageable
  • 𝜙(𝒙) becomes linear
  • The product of two valid kernels is not a valid kernel.

    False (B)

    What is the main purpose of using a feature map 𝜙(𝑥)?

    To transform input data into a higher-dimensional space.

    Kernel methods may suffer from the ___________ when dealing with very high-dimensional spaces.

    <p>Curse of Dimensionality</p> Signup and view all the answers

    Match the following statements with their corresponding descriptions:

    <p>Additivity = The sum of two valid kernels is also a valid kernel. Scalar Multiplication = A valid kernel multiplied by a positive scalar is still a valid kernel. Exponentiation = Raising a valid kernel to a positive power results in another valid kernel. Complex Patterns = Kernel methods might struggle to capture intricate relationships in data.</p> Signup and view all the answers

    Which of the following is a limitation of Kernel Least Square methods?

    <p>Overfitting can occur, especially in noisy datasets (A)</p> Signup and view all the answers

    In linear least squares, non-linear relationships can be directly modeled without transformation.

    <p>False (B)</p> Signup and view all the answers

    What is one approach to tackle non-linear relationships in Kernel Least Squares?

    <p>Mapping input data into a higher-dimensional space.</p> Signup and view all the answers

    What approach is typically chosen for determining the optimal order of testing features in a decision tree?

    <p>Greedy approach (D)</p> Signup and view all the answers

    The optimal order of testing features in a decision tree can always be found efficiently.

    <p>False (B)</p> Signup and view all the answers

    What is measured to determine the information content of a feature in a decision tree?

    <p>Reduction in uncertainty</p> Signup and view all the answers

    In a decision tree, we choose the feature that has the highest __________ content.

    <p>information</p> Signup and view all the answers

    What is the entropy of the distribution (0.01,0.99)?

    <p>0.081 (A)</p> Signup and view all the answers

    Testing a feature in a decision tree reduces the uncertainty and provides useful information.

    <p>True (A)</p> Signup and view all the answers

    Which mathematical expression calculates the entropy of a distribution?

    <p>−∑ P(c_i) log2(P(c_i))</p> Signup and view all the answers

    Match the decision tree terms with their descriptions:

    <p>Greedy approach = Makes the best local choice at each step Entropy = Measure of uncertainty in a distribution Information gain = Reduction in uncertainty from a feature Feature selection = Choosing the best feature to test first</p> Signup and view all the answers

    What is the primary function of Max Pooling in convolutional neural networks?

    <p>Reduce dimensionality (B)</p> Signup and view all the answers

    ReLU is a linear activation function used in convolutional layers.

    <p>False (B)</p> Signup and view all the answers

    What is the significance of learning convolutional filters from the bottom up?

    <p>It allows the model to learn more features and adapt to various object orientations and colors.</p> Signup and view all the answers

    AlexNet has approximately _____ million parameters.

    <p>60</p> Signup and view all the answers

    Which of the following operations is performed before applying the convolutional layers in AlexNet?

    <p>Max Pooling (A), Normalization (C)</p> Signup and view all the answers

    Match the layers of AlexNet with their descriptions:

    <p>CONV1 = First layer that applies 11x11 filters MAX POOL1 = Reduces size after CONV1 FC6 = Fully connected layer with many parameters NORM1 = Normalizes activations after CONV1</p> Signup and view all the answers

    The ImageNet challenge significantly advanced deep learning due to large datasets and GPU utilization.

    <p>True (A)</p> Signup and view all the answers

    What is the output volume size after applying the CONV2 layer in AlexNet?

    <p>55x55x96</p> Signup and view all the answers

    What is the main characteristic of the k-nearest neighbors (k-NN) algorithm?

    <p>It uses instance-based learning. (B)</p> Signup and view all the answers

    K-NN algorithm does not make predictions based on the proximity of new instances to existing ones.

    <p>False (B)</p> Signup and view all the answers

    What metric is commonly used to measure distance in the k-NN algorithm?

    <p>Euclidean Distance</p> Signup and view all the answers

    In the k-NN classification, the query point's class label is determined by finding the __________ of the class labels among its k-nearest neighbors.

    <p>mode</p> Signup and view all the answers

    What strategy can be used to resolve ties in k-NN classification?

    <p>Randomly select one of the tied classes. (B)</p> Signup and view all the answers

    Match the following distance metrics with their appropriate definitions:

    <p>Euclidean Distance = Measures the straight-line distance between two points Manhattan Distance = Measures the sum of absolute differences between coordinates</p> Signup and view all the answers

    The k-NN algorithm only works for classification tasks and cannot be used for regression tasks.

    <p>False (B)</p> Signup and view all the answers

    If a query point has neighbors' class labels [0, 1, 1, 0], what is the predicted class?

    <p>1</p> Signup and view all the answers

    What is the main idea behind Recurrent Neural Networks (RNNs)?

    <p>To capture information about the past using hidden states (D)</p> Signup and view all the answers

    In a feedforward network, each layer receives input from both the previous layer and the output from the previous time step.

    <p>False (B)</p> Signup and view all the answers

    What do all layers in an RNN share?

    <p>the same model parameters (U, V, W)</p> Signup and view all the answers

    An RNN layer captures information about the past using its hidden __________.

    <p>state</p> Signup and view all the answers

    Match the following components with their characteristics in RNNs:

    <p>Hidden State = Captures information about previous time steps Input Layer = Receives data inputs for processing Time Steps = Sequential processing of data Output Layer = Produces final predictions or outputs</p> Signup and view all the answers

    Which of the following describes the flow of information in a feedforward network?

    <p>Data flows linearly from input to output without loops (D)</p> Signup and view all the answers

    RNNs can only process data one time step at a time.

    <p>True (A)</p> Signup and view all the answers

    What is a key difference in how the layers of RNNs function compared to traditional feedforward networks?

    <p>RNN layers take input from both the previous layer and the previous time step.</p> Signup and view all the answers

    Which of the following optimizers is NOT mentioned for updating weights and biases in a feedforward neural network?

    <p>Newton's Method (C)</p> Signup and view all the answers

    Backpropagation is used to minimize the loss function in neural networks.

    <p>True (A)</p> Signup and view all the answers

    What is represented by the variable 'l' in the context of a feedforward neural network?

    <p>loss function</p> Signup and view all the answers

    The activation of the first layer is calculated using the formula 𝑎1 = 𝑊1𝑋 + ______.

    <p>𝑏1</p> Signup and view all the answers

    Match the variables with their corresponding meanings:

    <p>𝑊 = Weights of the network 𝑏 = Bias term 𝑎 = Activation values 𝑜 = Output values</p> Signup and view all the answers

    In the context of backpropagation, what does the variable '𝜂' typically represent?

    <p>Learning rate (B)</p> Signup and view all the answers

    The output for a given layer is computed using the same weights as the previous layer.

    <p>False (B)</p> Signup and view all the answers

    What mathematical operation is performed when updating the weights using backpropagation?

    <p>Subtraction of the gradient multiplied by the learning rate</p> Signup and view all the answers

    The formula for updating weights involves the gradient of the loss function with respect to ______.

    <p>weights</p> Signup and view all the answers

    Which of the following best describes the purpose of the activation function in a neural network?

    <p>To introduce non-linearity into the model (C)</p> Signup and view all the answers

    Study Notes

    Supervised Learning

    • Supervised learning involves a mapping function from input features (X) to output labels (Y)
    • Classification: Y is discrete, e.g., Y ∈ {1, 2, ..., k}
    • Regression: Y is continuous, e.g., Y ∈ R
    • Linear Separability: Ability to separate two classes of data points using a single hyperplane in a feature space.
    • A dataset is linearly separable if a hyperplane exists that places all data points of one class on one side and the other class on the opposite side.
    • w is the weight vector (defining the hyperplane's orientation).
    • x is the input feature vector.
    • b is the bias term (defining the hyperplane's position).
    • For a dataset to be linearly separable, there must exist a weight vector w and a bias b that satisfies: yi(w•xi + b) > 0 for all i.

    Linear Classification

    • Perceptron Algorithm: A foundational algorithm for binary classification. It aims to find a linear decision boundary that separates two classes. It iteratively updates weights based on misclassifications.
    • f(x) = sign(w•x + b)
    • w ∈ Rn is the weight vector.
    • b is the bias term.
    • Perceptron Algorithm Update Rule: When (xi, yi) is misclassified, update weights and bias: w ← w + nyixi, b ← b + nyi
    • where n is the learning rate.

    Linear Classification (SVM)

    • Support Vector Machine (SVM): Aims to find the maximum margin hyperplane that separates two classes with the largest possible margin.
    • The margin is the distance between the decision boundary and the closest data points from either class.
    • For correctly classified points, yi(w•xi + b) ≥ 1, for all i.
    • The goal is to maximize the margin, which is equivalent to minimizing ||w||^2 / 2.

    Linear Regression

    • Linear regression models the relationship between input features (X ∈ Rnxp) and target values (y ∈ Rn) as: y = Xβ + e.
    • X is the design matrix of input features.
    • β is the vector of coefficients.
    • e is the error term.
    • To estimate β, minimize the sum of squared residuals: minβ ||y – Xβ||^2 = minβ Σ(yi – Xβi)^2, i=1 to n

    Weighted Least Square

    • In WLS, observations have non-constant variance. The goal is to account for heteroscedasticity giving different weights to different observations, where higher variance gets lower weight.

    Ridge Regression

    • In OLS, we aim to minimize the residual sum of squares: minβ ||y – Xβ||^2.
    • Multicollinearity can cause instability in the inverse (XTX)^-1 making OLS unreliable.
    • Ridge Regression addresses multicollinearity by adding a regularization term that penalizes large values of β: minβ ||y – Xβ||^2 + λ||β||^2.
    • λ ≥ 0 is the regularization parameter that controls the amount of regularization applied.
    • λ = 0 gives the OLS solution.
    • λ > 0 penalizes large coefficients, reducing overfitting and addressing multicollinearity.

    Limitations of the Perceptron

    • Requirement: Linear Separability: The perceptron algorithm only converges if the data is linearly separable.
    • Non-linear Separability: Fails on datasets (like XOR) where no linear hyperplane can separate the classes. (May run indefinitely)
    • No Margin Optimization: The perceptron does not seek the hyperplane with the largest margin, potentially leading to poor generalization.

    Kernel Functions

    • Additivity: The sum of two valid kernels is also a valid kernel.
    • Scalar Multiplication: The product of a valid kernel and a positive scalar is also a valid kernel.
    • Product of Kernels: The product of two valid kernels is also a valid kernel.
    • Exponentiation: Raising a valid kernel to a positive power yields another valid kernel.

    Kernel Least Squares

    • In linear least squares, the model is limited to linear relationships between input features and outputs. To handle non-linear relationships, a feature map (φ(x)) maps the input data into a higher-dimensional space.
    • minβ ||y – Φ(X)β||^2

    k-Nearest Neighbor (k-NN)

    • k-NN is an intuitive, simple, but powerful non-parametric algorithm for both classification and regression tasks.
    • k-NN stores the entire training dataset and makes predictions for new data points by comparing them to stored instances. No explicit training occurs.
    • k-NN predictions are based on the proximity of new instances to existing ones.

    k-Nearest Neighbor - Classification

    • In classification, k-NN assigns a class label to the query point (xq) based on the most frequent label (the mode) among its k-nearest neighbors.

    k-Nearest Neighbor - Regression

    • In regression, k-NN predicts the output based on the average of the target values of the k-nearest neighbors for a query point (xq).
    • To improve performance, weighted k-NN may be used where each neighbor's influence is weighted by its distance from the query point.

    Decision Tree

    • A decision tree is a simple supervised classification model used to classify a single discrete target feature.
    • Each internal node performs a Boolean test on an input feature. The edges are labeled with the values of that input feature.
    • Each leaf node specifies a value for the target feature.

    Decision Tree - Splitting Criteria

    • Information Gain: measures the reduction in uncertainty after testing a feature. High gain is preferred.
    • Gini Index: Computations are efficient. Preferred in imbalanced datasets. Measures the probability of misclassifying a randomly chosen element from a node. Low index is preferred.

    Pre-Pruning (Decision Tree)

    • Stopping tree growth early based on criteria that can prevent overfitting
    • Maximum Depth: Predefined maximum depth for the tree.
    • Minimum Examples at a Leaf: Predefined minimum number of examples needed at a leaf node.
    • Minimal Information Gain: Splitting is only done if the associated information gain exceeds a pre-defined threshold value (measure of improvement).
    • Reduction in Training Error: Stop splitting if the reduction in training error is below a pre-defined threshold

    Post-Pruning (Decision Tree)

    • Grow the full tree first and trim it afterwards.
    • Restrict attention to nodes that only have leaf nodes as descendants.
    • If expected information gain below a threshold, delete children of this node and make a majority decision at that node.

    Feedforward Neural Network (FNN)

    • Neural networks map input features (X ∈ Rn) to output labels (Y).
    • Classification: Y is discrete (e.g., Y ∈ {1, 2, ..., k}).
    • Regression: Y is continuous (e.g., Y ∈ R).
    • Key components: Layers (depth), Width (number of neurons per layer), Activation function, Loss function (e.g., Mean Squared Error [MSE], Cross-entropy).

    Feedforward Neural Network - Backpropagation

    • Gradient descent is used to find optimal weights and biases (iteratively) in FNN.
    • Optimizers such as Stochastic Gradient Descent(SGD), SGD with momentum, and Adam are used.

    Feedforward Neural Network - Activation Functions

    • Activation functions introduce non-linearity enabling networks to approximate complex relationships.

    CNN

    • CNNs are neural networks with convolutional layers.
    • Convolution layers extract local features.
    • These layers learn filters.
    • Filters are repeatedly slid across the image.
    • Pooling layers reduce dimensionality.
    • CNNs employ a more structured way to process images by using spatial information and sharing weights.

    RNN

    • RNNs are neural networks designed for processing sequences.
    • Key idea: use a hidden state to capture information about the past.
    • Layers use shared parameters.
    • Recurrent means layers are recurrently influenced by the past to make the output.

    RNN Variants: Different Number of Hidden Layers

    • RNNs are typically deep nets where the layers share weights. Deeper nets tend to perform better, based on experiments

    RNN: Vanishing Gradient Problem

    • Problem: training to learn long term dependencies is difficult because of the vanishing gradient problem. (Weights are often too small)
    • Exploding gradient leads to difficulty during training.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    More Like This

    Untitled Quiz
    6 questions

    Untitled Quiz

    AdoredHealing avatar
    AdoredHealing
    Untitled Quiz
    37 questions

    Untitled Quiz

    WellReceivedSquirrel7948 avatar
    WellReceivedSquirrel7948
    Untitled Quiz
    18 questions

    Untitled Quiz

    RighteousIguana avatar
    RighteousIguana
    Untitled Quiz
    48 questions

    Untitled Quiz

    StraightforwardStatueOfLiberty avatar
    StraightforwardStatueOfLiberty
    Use Quizgecko on...
    Browser
    Browser