ML & Data Science: Bias-Variance and Imbalanced Data

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Consider a classification problem with imbalanced data. A classifier that always predicts the majority class achieves 90% accuracy. Which of the following statements is most accurate?

  • The classifier is an acceptable starting point and should be further optimized for better performance.
  • The classifier is not useful because it does not provide any insight into the minority class. (correct)
  • The classifier is performing well, as it achieves high accuracy.
  • The classifier is overfitting to the majority class.

Assume you are building a model to predict fraudulent transactions. The dataset is highly imbalanced, with only 2% of transactions being fraudulent. Which evaluation metric is the most appropriate to use?

  • F1 Score (correct)
  • Precision
  • Accuracy
  • Recall

How does increasing model complexity typically affect bias and variance?

  • Increases both bias and variance
  • Decreases both bias and variance
  • Decreases bias and increases variance (correct)
  • Increases bias and decreases variance

You're using PCA to reduce the dimensionality of your dataset. You notice that the first two principal components explain 95% of the variance. Which of the following is the most reasonable conclusion?

<p>It is safe to reduce the dataset to two dimensions, as very little information is lost. (A)</p> Signup and view all the answers

In PCA, what is the significance of the eigenvectors derived from the covariance matrix of the data?

<p>They define the directions of the new feature vectors, also known as principal components. (D)</p> Signup and view all the answers

Consider a dataset where you want to predict whether a customer will click on an ad (binary classification). You have a Naive Bayes' classifier. Under what condition is the Naive Bayes' assumption most likely to be problematic?

<p>When the features are highly correlated with each other. (A)</p> Signup and view all the answers

In the context of Bayesian learning, differentiate between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.

<p>MLE maximizes the probability of the data, while MAP maximizes the posterior probability of the parameters given the data. (C)</p> Signup and view all the answers

Which statement best describes the role of the sigmoid function in logistic regression?

<p>Transforms the linear combination of inputs into a probability between 0 and 1. (B)</p> Signup and view all the answers

How does L1 regularization differ from L2 regularization in the context of linear regression?

<p>L1 regularization encourages sparsity in the model by setting some coefficients to zero, while L2 regularization shrinks coefficients towards zero without necessarily setting them to zero. (C)</p> Signup and view all the answers

What is the primary purpose of dropout in neural networks?

<p>To prevent overfitting by randomly deactivating neurons during training. (D)</p> Signup and view all the answers

In the context of Convolutional Neural Networks (CNNs), what is the role of a pooling layer?

<p>To reduce the dimensionality of the feature maps and provide translation invariance. (D)</p> Signup and view all the answers

A CNN has multiple convolutional layers. How does the complexity of features detected typically change from earlier to later layers?

<p>Earlier layers detect simple edges and textures, while later layers detect complex, high-level features. (A)</p> Signup and view all the answers

In the context of ensemble learning, what is the primary difference between bagging and boosting?

<p>Bagging aims to reduce variance, whereas boosting aims to reduce bias. (A)</p> Signup and view all the answers

What is the role of a meta-learner in a stacking ensemble method?

<p>To combine the predictions of the base learners into a final prediction. (B)</p> Signup and view all the answers

How does the encoder in an autoencoder contribute to data compression?

<p>By mapping the data to a lower-dimensional latent space. (D)</p> Signup and view all the answers

What is the primary difference between an autoencoder (AE) and a variational autoencoder (VAE)?

<p>VAEs learn a continuous latent space, whereas AEs may have a discontinuous latent space. (A)</p> Signup and view all the answers

In data structures, what is the key distinction between a List and a Linked List regarding memory usage and access?

<p>Lists allocate contiguous memory blocks allowing for random access via index, while Linked Lists use non-contiguous memory with sequential access through pointers. (A)</p> Signup and view all the answers

What is the difference between a Stack and a Queue in terms of accessing elements?

<p>Stacks use a Last-In-First-Out (LIFO) approach, while Queues use a First-In-First-Out (FIFO) approach. (B)</p> Signup and view all the answers

In a hash table, what is the main reason for collisions, and how are collisions typically resolved?

<p>Collisions occur when different keys produce the same hash value, resolved typically using separate chaining or open addressing. (C)</p> Signup and view all the answers

In tree data structures, how does a 'leaf node' differ from a 'root node'?

<p>Root nodes do not have parents; leaf nodes do not have children. (D)</p> Signup and view all the answers

How do graphs differ from trees in data structure characteristics?

<p>Graphs can contain cycles; trees are acyclic and have a hierarchical structure with one root. (D)</p> Signup and view all the answers

When preparing for coding interviews, why is it recommended to focus on easy to medium difficulty questions first?

<p>To build a strong foundation and familiarity with fundamental concepts. (C)</p> Signup and view all the answers

During a coding interview, what is the benefit of 'thinking out loud' while coding?

<p>It allows the interviewer to understand your problem-solving approach and thought process. (C)</p> Signup and view all the answers

During a behavioral interview, what is the STAR method primarily used for?

<p>To structure answers to questions about past experiences in a clear and concise manner. (D)</p> Signup and view all the answers

When preparing stories for a behavioral interview, why is it recommended to assign keywords to each story?

<p>To easily recall and match relevant stories to different interview questions based on their themes. (D)</p> Signup and view all the answers

What does the 'Action' component of the STAR method involve?

<p>Explaining the specific steps you took to address the problem or situation. (D)</p> Signup and view all the answers

During PCA, if the original data isn't scaled, what is the implication for the identified principal components?

<p>Variables with larger variances prior to PCA will have a disproportionately larger influence on the principal components. (A)</p> Signup and view all the answers

Suppose that you are building an autoencoder for data compression. During data evaluation, you notice that autoencoder struggles with a specific subset of rare inputs in the dataset. What adjustments can you perform?

<p>Over-sample the rare inputs within your training data such that your model sees more of those specific rare inputs. (B)</p> Signup and view all the answers

Given a scenario where the bias is high and the variance is low. What course of action would you recommend?

<p>Add more features to the model and decrease regularization. (C)</p> Signup and view all the answers

What modifications could be made to a Loss function?

<p>The addition of an L1 or L2 penalty. (B)</p> Signup and view all the answers

What is the importance of convolutional layers in CNNs?

<p>They detect patterns between spatially related data. (C)</p> Signup and view all the answers

AlexNet and ResNet sought out to improve the results, what was a key different between what each of them preformed?

<p>While ResNet worked to solve the vanishing gradient problem, AlexNet sought out to make improvements via ReLU. (D)</p> Signup and view all the answers

What would be the advantages of bagging?

<p>It decreases the variance in the predictions (C)</p> Signup and view all the answers

Which statements best describes the benefit of ensembling.?

<p>Ensembling can reduce variance or bias. (A)</p> Signup and view all the answers

Which data structure is not appropriate for coding during interviews?

<p>There aren't unappropriated data structures for interviews (B)</p> Signup and view all the answers

What is an appropriate strategy to prepare for the interview?

<p>All the previous options. (D)</p> Signup and view all the answers

What is the first strategical step to answer the behavioral interview?

<p>Extract useful keywords that encapsulates the gist of the question. (C)</p> Signup and view all the answers

Think what the task component in your stories is useful, and pick the best options.

<p>Explain your responsibility in the situation. (C)</p> Signup and view all the answers

Flashcards

What is Bias?

Error between average model prediction and ground truth. It tells us the capacity of the underlying model to predict the values.

What is Variance?

Average variability in the model prediction for the given dataset. It tells you how much the function can adjust to the change in the dataset

High Bias

Occurs when the model is too simple, leading to under-fitting. It also leads to high error on both test and train data

High Variance

Occurs when the model is overly complex, leading to over-fitting. It also leads to Low error on train data and high on test

Signup and view all the flashcards

Bias variance Trade-off

Increasing bias reduces variance and vice-versa. The best model is where the error is reduced. You must compromise between bias and variance

Signup and view all the flashcards

Precision

Correct prediction over total predictions.

Signup and view all the flashcards

Recall

Correctly detected positives over total actual positives.

Signup and view all the flashcards

F1 Score

Harmonic mean of Precision and Recall

Signup and view all the flashcards

Data Replication

Addresses class imbalance by replicating minority class data.

Signup and view all the flashcards

Synthetic Data

Creates synthetic data via transformations/noise to balance classes

Signup and view all the flashcards

Modified Loss

Modifies the loss to reflect greater error when misclassifying smaller sample set; loss = aloss_green + bloss_blue

Signup and view all the flashcards

Change the Algorithm

increasing model complexity so that the two classes are perfectly separable (Con: Overfitting)

Signup and view all the flashcards

What is PCA?

Finds orthogonal feature vectors maximizing data spread, rates them by variance.

Signup and view all the flashcards

Steps for PCA

Standardize data, find covariance matrix, eigenvalue decomposition, sort eigenvalues.

Signup and view all the flashcards

Dimensionality Reduction with PCA

Keep top feature vectors by PCA to preserve maximum information.

Signup and view all the flashcards

Bayes' Theorem

Describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Signup and view all the flashcards

Maximum Aposteriori Probability (MAP) Estimation

MAP estimate of random variable y accomodates prior knowledge when estimating. ŷ = argmaxy P(y) ∏ P(xi|y)

Signup and view all the flashcards

Maximum Likelihood Estimation (MLE)

The MAP estimate of the random variable y, assuming we don't have any prior knowledge of the quantity being estimated. ŷ = argmaxy ∏ P(xi|y)

Signup and view all the flashcards

Naïve Bayes' Classifier

Bayes' theorem assumes features are i.i.d. to simplify calculations.

Signup and view all the flashcards

Regression Analysis

Fits a function f(.) to datapoints yᵢ=f(xᵢ) under some error function.

Signup and view all the flashcards

L2 Regularization

Prevents weights from getting too large (defined by L2 norm). Larger the weights, more complex the model is, more chances of overfitting; loss = error(y, ŷ) + λ Σβ²

Signup and view all the flashcards

L1 Regularization

Prevents weights from getting too large (defined by L1 norm). Larger the weights, more complex the model is, more chances of overfitting; loss = error(y, ŷ) + λ Σ|β|

Signup and view all the flashcards

Entropy Regularization

Forces the probability distribution towards uniform distribution; loss = error(p,p) - λ Σ pᵢlog(pᵢ)

Signup and view all the flashcards

Data augmentation

Creating more data from available data by randomly cropping, dilating, rotating, adding small amount of noise etc.

Signup and view all the flashcards

K-fold Cross-validation

Divide the data into k groups. Train on (k-1) groups and test on 1 group. Try all k possible combinations.

Signup and view all the flashcards

Injecting noise

Add random noise to the weights when they are being learned to be relatively insensitive to small variations, regularizing the model.

Signup and view all the flashcards

Dropout

Used for neural networks; Connections between consecutive layers are randomly dropped based on a dropout-ratio and the remaining network is trained in the current iteration.

Signup and view all the flashcards

Layer function

Layer transforming data to CNN. Basic transforming function such as convolutional or fully connected layer

Signup and view all the flashcards

Fully Connected

Linear functions between the input and the

Signup and view all the flashcards

Convolutional Layers

Applied to 2D (3D) input feature maps. Trainable weights are a 2D (3D) kernel/filter that moves across the input feature map.

Signup and view all the flashcards

Transposed Convolutional (DeConvolutional) Layer

Usually used to increase the size of the output feature map (Upsampling). Idea behind that transposed convolutional layer is to 'undo' (not exactly) the convolutional layer

Signup and view all the flashcards

Max/Average Pooling

A non-trainable layer used to change the size of the feature map based on selecting the maximum/average value in receptive field defined by the kernel.

Signup and view all the flashcards

Normalization

Usually used just before the activation functions to limit the unbounded activation from increasing the output layer values too high

Signup and view all the flashcards

Batch Normalization

A trainable approach to normalizing the data by learning scale and shift variable during training.

Signup and view all the flashcards

Activation

Introduce non-linearity S0 CNN can efficiently map non-linear complex mapping. Example: Linear, ReLU

Signup and view all the flashcards

Loss function

Quantifies how far off the CNN prediction is from the actual labels. Quantifies the error in the prediction

Signup and view all the flashcards

AlexNet - 2012

Consists of 5 Convolutional (CONV) layers and 3 Fully Connected (FC) layers. The activation used is the Rectified Linear Unit (ReLU). Data augmentation is carried out to reduce over-fitting, Uses Local response localization.

Signup and view all the flashcards

VGGNet - 2014

Born out of the need to reduce the # of parameters in the CONV layers and improve on training time. There are multiple variants of VGGNet (VGG16, VGG19, etc.), all the conv kernels are of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.

Signup and view all the flashcards

ResNet - 2015

Neural Networks are notorious for not being able to find a simpler mapping when it exists so ResNet solves that; ResNet architecture makes use of shortcut connections do solve the vanishing gradient problem.

Signup and view all the flashcards

Inception (GoogLeNet) - 2014

Consists of several inception modules. Each inception module consists of four operations in parallel: 1x1 conv layer, 3x3 conv layer, 5x5 conv layer, max pooling

Signup and view all the flashcards

Study Notes

  • This document contains cheat sheets for Machine Learning and Data Science topics asked during interviews.

Bias-Variance Tradeoff

  • Bias measures the error between average prediction and ground truth, indicating the model's capacity to predict values.
  • Variance represents the variability in model predictions for a given dataset and how the model adjusts to dataset changes.
  • High bias indicates an overly simplified model (under-fitting), while high variance signifies an overly complex model (over-fitting).
  • Increasing bias can reduce variance and vice-versa.
  • Error is calculated by: Error = bias² + variance + irreducible error.
  • The best model reduces error through a compromise between bias and variance.

Imbalanced Data in Classification

  • Accuracy, while a common metric, does not always provide correct insights for trained models.
  • Precision measures the exactness of the model and Recall its completeness.
  • The F1 Score combines Precision and Recall.
  • In class 1, Precision = TP / (TP + FP) or True Positives / (True Positives + False Positives)
  • In class 1, Recall, Sensitivity is (TP)/(TP+FN) or True Positives / (True Positives + False Negatives)
  • Specificity = TN / (TN + FP) or True Negatives / (True Negatives + False Positives)
  • False Positive Rate = FP / (TN+FP) or False Positives / (True Negatives + False Positives)
  • Accuracy is (TP+TN) / (TP+TN+FP+FN) or (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)
  • Data replication, synthetic data generation, modified loss functions, and algorithm adjustments can address imbalanced data.
  • Synthetic Data involves image rotation, dilation, cropping, and noise addition to create new data.
  • Modify loss to reflect greater error when misclassifying smaller sample sets to address modified loss.

PCA Dimensionality Reduction

  • PCA finds a new set of orthogonal feature vectors in a dataset to maximize data spread in the feature vector direction.
  • Feature vectors are ranked in decreasing order of data spread (variance).
  • Datapoints show maximum variance in the first feature vector and minimum variance in the last.
  • Variance of datapoints in the feature vector direction indicates information measure.
  • Steps: Standardize datapoints, find the covariance matrix, perform eigenvalue decomposition, and sort.
  • Dimensionality Reduction Steps: Apply the steps above, keep the first m feature vectors from the sorted eigenvector matrix, transform the data for new basis, and note the importance of a feature vector is proportional to the magnitude of the eigenvalue.

Bayes Theorem and Classifier

  • Bayes' Theorem describes event probability based on prior knowledge of related conditions.
  • Bayes' Theorem is: P(A|B) = (P(B|A) * P(A)) / P(B).
  • MAP Estimation: The MAP estimate of a random variable y, given observed data, involves accommodating prior knowledge during estimation.
  • MLE (Maximum Likelihood Estimation) is a special case of MAP where the prior is uniform.
  • Naive Bayes' assumes the features are independent.

Regression Analysis

  • Regression analysis involves fitting a function f(.) to data points y=f(xi) under an error function.
  • Linear Regression fits a line minimizing the sum of mean-squared error for each data point.
  • Polynomial Regression fits a polynomial of order k minimizing the sum of mean-squared error.
  • Bayesian Regression fits a Gaussian distribution by minimizing the mean-squared error for each data point.
  • Ridge Regression fits a line or polynomial minimizing the sum of mean-squared error and the weighted L2 norm of the function parameters.
  • LASSO Regression fits a line or polynomial minimizing the mean-squared error and the weighted L1 norm.
  • Logistic Regression fits a line or polynomial with sigmoid activation, minimizing binary cross-entropy loss.

Regularization in ML

  • Regularization addresses over-fitting in ML, reducing model variance.
  • L2 Regularization prevents weights from becoming too large. Larger weights increase model complexity and chances of overfitting.
  • L1 Regularization prevents weights from becoming too large (defined by L1 norm) and introduces sparsity.
  • Entropy regularization is used for probability output models, pushing distribution towards uniformity.
  • Data augmentation can be applied to modify data sampling
  • K-fold Cross-validation divides the data into k groups, training on (k-1) and testing on 1.
  • Injecting noise, adding random noise to the weights during learning.
  • Dropout involves randomly dropping connections between consecutive layers in neural networks.

Convolutional Neural Network

  • Data enters CNN via the input layer, passing through hidden layers to the output, and backpropagation updates the weights.
  • CNN layers commonly include convolutional, pooling, normalization, activation, and loss calculation.
  • Convolutional layers apply filters across feature maps generating dot products.
  • Transposed Convolutional Layers increase output feature map size (Up-sampling).
  • Pooling layers are non-trainable and change feature map size.
  • Max/Average Pooling decreases spatial size, selecting max/average values defined by the kernel.
  • UnPooling increases spatial size by placing input pixels in defined receptive fields.
  • Various normalization approaches limit unbounded activation.
  • Regression and Classification are two main types of Loss Functions

Famous CNNs

  • AlexNet (2012): Consists of 5 CONV layers and 3 FC layers using ReLU activation and Local Response Localization.
  • VGGNet (2014): Reduces parameters in CONV layers and improves training time using small conv kernels (3x3) and maxpool kernels (2x2).
  • ResNet (2015): Connects layers to solve vanishing gradient problems, with multiple versions (ResNet50, ResNet101).
  • Inception (2014): Uses lager kernels for more global features while smaller detect area-specific, needing kernels of different sizes.

Ensemble Learning in ML

  • Ensemble Learning combines multiple weak models to improve bias, variance, and/or accuracy.
  • Bagging trains N weak models with subsets in parallel. It reduces the variance in the prediction.
  • Boosting trains N weak models sequentially, weighting misclassified points and decreasing bias.
  • Stacking trains N models of different types with one subset, using a meta-learner for final prediction and improves accuarcy.

Autoencoder & Variational Autoencoder

  • Autoencoders learns to find efficient embeddings of unlabeled data, with an encoder and decoder.
  • Autoencoders compress data from higher to lower dimensions and back and is trained on loss function for reconstruction.
  • VAE (Variational autoencoder) addresses non-regularized latent space, outputting parameters of a pre-defined distribution and imposes a constraint on this function, forcing it to be a normal distribution.
  • Latent variable is smooth and continuous and defined by training loss which is the sum of the reconstruction loss and KL divergence between the unit gaussian and decoder output distribution.

Data Structures

  • List: Ordered collection of elements accessed by index, in any order.
  • Linked List: Elements with values and pointers, traversed sequentially.
  • Stack: A sequential data structure accessed in LIFO order.
  • Queue: A sequential data structure accessed in FIFO order.
  • HashTable: Paired assignments accessed in constant time.
  • Tree: Hierarchical relation between root, parent, child, and leaf nodes.
  • Graph: A pair of sets (V, E) of vertices and edges, which can be cyclic.

Coding Interview Preparation

  • Timeline: Start preparing 2-3 months, easier after some experience
  • Review: Lists/Arrays, Linked List, Hash Table/Dictionary, Tree, Graph, Heap, Queue.
  • Practice: on LeetCode.com, InterviewBit.com, HackerRank.com
  • Listen, Talk, Discuss, Start Coding, Discuss, Optimize and Repeat

Behavioral Interview Preparation

  • The STAR method can be used to organize the stories one tells during and interview

  • Situation: provide the necessary context

  • Task: Explain what you were responsible for

  • Action: Provide steps that you took to address issue

  • Result: State outcome of your actions

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser