Podcast
Questions and Answers
Consider a classification problem with imbalanced data. A classifier that always predicts the majority class achieves 90% accuracy. Which of the following statements is most accurate?
Consider a classification problem with imbalanced data. A classifier that always predicts the majority class achieves 90% accuracy. Which of the following statements is most accurate?
- The classifier is an acceptable starting point and should be further optimized for better performance.
- The classifier is not useful because it does not provide any insight into the minority class. (correct)
- The classifier is performing well, as it achieves high accuracy.
- The classifier is overfitting to the majority class.
Assume you are building a model to predict fraudulent transactions. The dataset is highly imbalanced, with only 2% of transactions being fraudulent. Which evaluation metric is the most appropriate to use?
Assume you are building a model to predict fraudulent transactions. The dataset is highly imbalanced, with only 2% of transactions being fraudulent. Which evaluation metric is the most appropriate to use?
- F1 Score (correct)
- Precision
- Accuracy
- Recall
How does increasing model complexity typically affect bias and variance?
How does increasing model complexity typically affect bias and variance?
- Increases both bias and variance
- Decreases both bias and variance
- Decreases bias and increases variance (correct)
- Increases bias and decreases variance
You're using PCA to reduce the dimensionality of your dataset. You notice that the first two principal components explain 95% of the variance. Which of the following is the most reasonable conclusion?
You're using PCA to reduce the dimensionality of your dataset. You notice that the first two principal components explain 95% of the variance. Which of the following is the most reasonable conclusion?
In PCA, what is the significance of the eigenvectors derived from the covariance matrix of the data?
In PCA, what is the significance of the eigenvectors derived from the covariance matrix of the data?
Consider a dataset where you want to predict whether a customer will click on an ad (binary classification). You have a Naive Bayes' classifier. Under what condition is the Naive Bayes' assumption most likely to be problematic?
Consider a dataset where you want to predict whether a customer will click on an ad (binary classification). You have a Naive Bayes' classifier. Under what condition is the Naive Bayes' assumption most likely to be problematic?
In the context of Bayesian learning, differentiate between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.
In the context of Bayesian learning, differentiate between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.
Which statement best describes the role of the sigmoid function in logistic regression?
Which statement best describes the role of the sigmoid function in logistic regression?
How does L1 regularization differ from L2 regularization in the context of linear regression?
How does L1 regularization differ from L2 regularization in the context of linear regression?
What is the primary purpose of dropout in neural networks?
What is the primary purpose of dropout in neural networks?
In the context of Convolutional Neural Networks (CNNs), what is the role of a pooling layer?
In the context of Convolutional Neural Networks (CNNs), what is the role of a pooling layer?
A CNN has multiple convolutional layers. How does the complexity of features detected typically change from earlier to later layers?
A CNN has multiple convolutional layers. How does the complexity of features detected typically change from earlier to later layers?
In the context of ensemble learning, what is the primary difference between bagging and boosting?
In the context of ensemble learning, what is the primary difference between bagging and boosting?
What is the role of a meta-learner in a stacking ensemble method?
What is the role of a meta-learner in a stacking ensemble method?
How does the encoder in an autoencoder contribute to data compression?
How does the encoder in an autoencoder contribute to data compression?
What is the primary difference between an autoencoder (AE) and a variational autoencoder (VAE)?
What is the primary difference between an autoencoder (AE) and a variational autoencoder (VAE)?
In data structures, what is the key distinction between a List and a Linked List regarding memory usage and access?
In data structures, what is the key distinction between a List and a Linked List regarding memory usage and access?
What is the difference between a Stack and a Queue in terms of accessing elements?
What is the difference between a Stack and a Queue in terms of accessing elements?
In a hash table, what is the main reason for collisions, and how are collisions typically resolved?
In a hash table, what is the main reason for collisions, and how are collisions typically resolved?
In tree data structures, how does a 'leaf node' differ from a 'root node'?
In tree data structures, how does a 'leaf node' differ from a 'root node'?
How do graphs differ from trees in data structure characteristics?
How do graphs differ from trees in data structure characteristics?
When preparing for coding interviews, why is it recommended to focus on easy to medium difficulty questions first?
When preparing for coding interviews, why is it recommended to focus on easy to medium difficulty questions first?
During a coding interview, what is the benefit of 'thinking out loud' while coding?
During a coding interview, what is the benefit of 'thinking out loud' while coding?
During a behavioral interview, what is the STAR method primarily used for?
During a behavioral interview, what is the STAR method primarily used for?
When preparing stories for a behavioral interview, why is it recommended to assign keywords to each story?
When preparing stories for a behavioral interview, why is it recommended to assign keywords to each story?
What does the 'Action' component of the STAR method involve?
What does the 'Action' component of the STAR method involve?
During PCA, if the original data isn't scaled, what is the implication for the identified principal components?
During PCA, if the original data isn't scaled, what is the implication for the identified principal components?
Suppose that you are building an autoencoder for data compression. During data evaluation, you notice that autoencoder struggles with a specific subset of rare inputs in the dataset. What adjustments can you perform?
Suppose that you are building an autoencoder for data compression. During data evaluation, you notice that autoencoder struggles with a specific subset of rare inputs in the dataset. What adjustments can you perform?
Given a scenario where the bias is high and the variance is low. What course of action would you recommend?
Given a scenario where the bias is high and the variance is low. What course of action would you recommend?
What modifications could be made to a Loss function?
What modifications could be made to a Loss function?
What is the importance of convolutional layers in CNNs?
What is the importance of convolutional layers in CNNs?
AlexNet and ResNet sought out to improve the results, what was a key different between what each of them preformed?
AlexNet and ResNet sought out to improve the results, what was a key different between what each of them preformed?
What would be the advantages of bagging?
What would be the advantages of bagging?
Which statements best describes the benefit of ensembling.?
Which statements best describes the benefit of ensembling.?
Which data structure is not appropriate for coding during interviews?
Which data structure is not appropriate for coding during interviews?
What is an appropriate strategy to prepare for the interview?
What is an appropriate strategy to prepare for the interview?
What is the first strategical step to answer the behavioral interview?
What is the first strategical step to answer the behavioral interview?
Think what the task component in your stories is useful, and pick the best options.
Think what the task component in your stories is useful, and pick the best options.
Flashcards
What is Bias?
What is Bias?
Error between average model prediction and ground truth. It tells us the capacity of the underlying model to predict the values.
What is Variance?
What is Variance?
Average variability in the model prediction for the given dataset. It tells you how much the function can adjust to the change in the dataset
High Bias
High Bias
Occurs when the model is too simple, leading to under-fitting. It also leads to high error on both test and train data
High Variance
High Variance
Occurs when the model is overly complex, leading to over-fitting. It also leads to Low error on train data and high on test
Signup and view all the flashcards
Bias variance Trade-off
Bias variance Trade-off
Increasing bias reduces variance and vice-versa. The best model is where the error is reduced. You must compromise between bias and variance
Signup and view all the flashcards
Precision
Precision
Correct prediction over total predictions.
Signup and view all the flashcards
Recall
Recall
Correctly detected positives over total actual positives.
Signup and view all the flashcards
F1 Score
F1 Score
Harmonic mean of Precision and Recall
Signup and view all the flashcards
Data Replication
Data Replication
Addresses class imbalance by replicating minority class data.
Signup and view all the flashcards
Synthetic Data
Synthetic Data
Creates synthetic data via transformations/noise to balance classes
Signup and view all the flashcards
Modified Loss
Modified Loss
Modifies the loss to reflect greater error when misclassifying smaller sample set; loss = aloss_green + bloss_blue
Signup and view all the flashcards
Change the Algorithm
Change the Algorithm
increasing model complexity so that the two classes are perfectly separable (Con: Overfitting)
Signup and view all the flashcards
What is PCA?
What is PCA?
Finds orthogonal feature vectors maximizing data spread, rates them by variance.
Signup and view all the flashcards
Steps for PCA
Steps for PCA
Standardize data, find covariance matrix, eigenvalue decomposition, sort eigenvalues.
Signup and view all the flashcards
Dimensionality Reduction with PCA
Dimensionality Reduction with PCA
Keep top feature vectors by PCA to preserve maximum information.
Signup and view all the flashcards
Bayes' Theorem
Bayes' Theorem
Describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Signup and view all the flashcards
Maximum Aposteriori Probability (MAP) Estimation
Maximum Aposteriori Probability (MAP) Estimation
MAP estimate of random variable y accomodates prior knowledge when estimating. ŷ = argmaxy P(y) ∏ P(xi|y)
Signup and view all the flashcards
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation (MLE)
The MAP estimate of the random variable y, assuming we don't have any prior knowledge of the quantity being estimated. ŷ = argmaxy ∏ P(xi|y)
Signup and view all the flashcards
Naïve Bayes' Classifier
Naïve Bayes' Classifier
Bayes' theorem assumes features are i.i.d. to simplify calculations.
Signup and view all the flashcards
Regression Analysis
Regression Analysis
Fits a function f(.) to datapoints yᵢ=f(xᵢ) under some error function.
Signup and view all the flashcards
L2 Regularization
L2 Regularization
Prevents weights from getting too large (defined by L2 norm). Larger the weights, more complex the model is, more chances of overfitting; loss = error(y, ŷ) + λ Σβ²
Signup and view all the flashcards
L1 Regularization
L1 Regularization
Prevents weights from getting too large (defined by L1 norm). Larger the weights, more complex the model is, more chances of overfitting; loss = error(y, ŷ) + λ Σ|β|
Signup and view all the flashcards
Entropy Regularization
Entropy Regularization
Forces the probability distribution towards uniform distribution; loss = error(p,p) - λ Σ pᵢlog(pᵢ)
Signup and view all the flashcards
Data augmentation
Data augmentation
Creating more data from available data by randomly cropping, dilating, rotating, adding small amount of noise etc.
Signup and view all the flashcards
K-fold Cross-validation
K-fold Cross-validation
Divide the data into k groups. Train on (k-1) groups and test on 1 group. Try all k possible combinations.
Signup and view all the flashcards
Injecting noise
Injecting noise
Add random noise to the weights when they are being learned to be relatively insensitive to small variations, regularizing the model.
Signup and view all the flashcards
Dropout
Dropout
Used for neural networks; Connections between consecutive layers are randomly dropped based on a dropout-ratio and the remaining network is trained in the current iteration.
Signup and view all the flashcards
Layer function
Layer function
Layer transforming data to CNN. Basic transforming function such as convolutional or fully connected layer
Signup and view all the flashcards
Fully Connected
Fully Connected
Linear functions between the input and the
Signup and view all the flashcards
Convolutional Layers
Convolutional Layers
Applied to 2D (3D) input feature maps. Trainable weights are a 2D (3D) kernel/filter that moves across the input feature map.
Signup and view all the flashcards
Transposed Convolutional (DeConvolutional) Layer
Transposed Convolutional (DeConvolutional) Layer
Usually used to increase the size of the output feature map (Upsampling). Idea behind that transposed convolutional layer is to 'undo' (not exactly) the convolutional layer
Signup and view all the flashcards
Max/Average Pooling
Max/Average Pooling
A non-trainable layer used to change the size of the feature map based on selecting the maximum/average value in receptive field defined by the kernel.
Signup and view all the flashcards
Normalization
Normalization
Usually used just before the activation functions to limit the unbounded activation from increasing the output layer values too high
Signup and view all the flashcards
Batch Normalization
Batch Normalization
A trainable approach to normalizing the data by learning scale and shift variable during training.
Signup and view all the flashcards
Activation
Activation
Introduce non-linearity S0 CNN can efficiently map non-linear complex mapping. Example: Linear, ReLU
Signup and view all the flashcards
Loss function
Loss function
Quantifies how far off the CNN prediction is from the actual labels. Quantifies the error in the prediction
Signup and view all the flashcards
AlexNet - 2012
AlexNet - 2012
Consists of 5 Convolutional (CONV) layers and 3 Fully Connected (FC) layers. The activation used is the Rectified Linear Unit (ReLU). Data augmentation is carried out to reduce over-fitting, Uses Local response localization.
Signup and view all the flashcards
VGGNet - 2014
VGGNet - 2014
Born out of the need to reduce the # of parameters in the CONV layers and improve on training time. There are multiple variants of VGGNet (VGG16, VGG19, etc.), all the conv kernels are of size 3x3 and maxpool kernels are of size 2x2 with a stride of two.
Signup and view all the flashcards
ResNet - 2015
ResNet - 2015
Neural Networks are notorious for not being able to find a simpler mapping when it exists so ResNet solves that; ResNet architecture makes use of shortcut connections do solve the vanishing gradient problem.
Signup and view all the flashcards
Inception (GoogLeNet) - 2014
Inception (GoogLeNet) - 2014
Consists of several inception modules. Each inception module consists of four operations in parallel: 1x1 conv layer, 3x3 conv layer, 5x5 conv layer, max pooling
Signup and view all the flashcardsStudy Notes
- This document contains cheat sheets for Machine Learning and Data Science topics asked during interviews.
Bias-Variance Tradeoff
- Bias measures the error between average prediction and ground truth, indicating the model's capacity to predict values.
- Variance represents the variability in model predictions for a given dataset and how the model adjusts to dataset changes.
- High bias indicates an overly simplified model (under-fitting), while high variance signifies an overly complex model (over-fitting).
- Increasing bias can reduce variance and vice-versa.
- Error is calculated by: Error = bias² + variance + irreducible error.
- The best model reduces error through a compromise between bias and variance.
Imbalanced Data in Classification
- Accuracy, while a common metric, does not always provide correct insights for trained models.
- Precision measures the exactness of the model and Recall its completeness.
- The F1 Score combines Precision and Recall.
- In class 1, Precision = TP / (TP + FP) or True Positives / (True Positives + False Positives)
- In class 1, Recall, Sensitivity is (TP)/(TP+FN) or True Positives / (True Positives + False Negatives)
- Specificity = TN / (TN + FP) or True Negatives / (True Negatives + False Positives)
- False Positive Rate = FP / (TN+FP) or False Positives / (True Negatives + False Positives)
- Accuracy is (TP+TN) / (TP+TN+FP+FN) or (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)
- Data replication, synthetic data generation, modified loss functions, and algorithm adjustments can address imbalanced data.
- Synthetic Data involves image rotation, dilation, cropping, and noise addition to create new data.
- Modify loss to reflect greater error when misclassifying smaller sample sets to address modified loss.
PCA Dimensionality Reduction
- PCA finds a new set of orthogonal feature vectors in a dataset to maximize data spread in the feature vector direction.
- Feature vectors are ranked in decreasing order of data spread (variance).
- Datapoints show maximum variance in the first feature vector and minimum variance in the last.
- Variance of datapoints in the feature vector direction indicates information measure.
- Steps: Standardize datapoints, find the covariance matrix, perform eigenvalue decomposition, and sort.
- Dimensionality Reduction Steps: Apply the steps above, keep the first m feature vectors from the sorted eigenvector matrix, transform the data for new basis, and note the importance of a feature vector is proportional to the magnitude of the eigenvalue.
Bayes Theorem and Classifier
- Bayes' Theorem describes event probability based on prior knowledge of related conditions.
- Bayes' Theorem is: P(A|B) = (P(B|A) * P(A)) / P(B).
- MAP Estimation: The MAP estimate of a random variable y, given observed data, involves accommodating prior knowledge during estimation.
- MLE (Maximum Likelihood Estimation) is a special case of MAP where the prior is uniform.
- Naive Bayes' assumes the features are independent.
Regression Analysis
- Regression analysis involves fitting a function f(.) to data points y=f(xi) under an error function.
- Linear Regression fits a line minimizing the sum of mean-squared error for each data point.
- Polynomial Regression fits a polynomial of order k minimizing the sum of mean-squared error.
- Bayesian Regression fits a Gaussian distribution by minimizing the mean-squared error for each data point.
- Ridge Regression fits a line or polynomial minimizing the sum of mean-squared error and the weighted L2 norm of the function parameters.
- LASSO Regression fits a line or polynomial minimizing the mean-squared error and the weighted L1 norm.
- Logistic Regression fits a line or polynomial with sigmoid activation, minimizing binary cross-entropy loss.
Regularization in ML
- Regularization addresses over-fitting in ML, reducing model variance.
- L2 Regularization prevents weights from becoming too large. Larger weights increase model complexity and chances of overfitting.
- L1 Regularization prevents weights from becoming too large (defined by L1 norm) and introduces sparsity.
- Entropy regularization is used for probability output models, pushing distribution towards uniformity.
- Data augmentation can be applied to modify data sampling
- K-fold Cross-validation divides the data into k groups, training on (k-1) and testing on 1.
- Injecting noise, adding random noise to the weights during learning.
- Dropout involves randomly dropping connections between consecutive layers in neural networks.
Convolutional Neural Network
- Data enters CNN via the input layer, passing through hidden layers to the output, and backpropagation updates the weights.
- CNN layers commonly include convolutional, pooling, normalization, activation, and loss calculation.
- Convolutional layers apply filters across feature maps generating dot products.
- Transposed Convolutional Layers increase output feature map size (Up-sampling).
- Pooling layers are non-trainable and change feature map size.
- Max/Average Pooling decreases spatial size, selecting max/average values defined by the kernel.
- UnPooling increases spatial size by placing input pixels in defined receptive fields.
- Various normalization approaches limit unbounded activation.
- Regression and Classification are two main types of Loss Functions
Famous CNNs
- AlexNet (2012): Consists of 5 CONV layers and 3 FC layers using ReLU activation and Local Response Localization.
- VGGNet (2014): Reduces parameters in CONV layers and improves training time using small conv kernels (3x3) and maxpool kernels (2x2).
- ResNet (2015): Connects layers to solve vanishing gradient problems, with multiple versions (ResNet50, ResNet101).
- Inception (2014): Uses lager kernels for more global features while smaller detect area-specific, needing kernels of different sizes.
Ensemble Learning in ML
- Ensemble Learning combines multiple weak models to improve bias, variance, and/or accuracy.
- Bagging trains N weak models with subsets in parallel. It reduces the variance in the prediction.
- Boosting trains N weak models sequentially, weighting misclassified points and decreasing bias.
- Stacking trains N models of different types with one subset, using a meta-learner for final prediction and improves accuarcy.
Autoencoder & Variational Autoencoder
- Autoencoders learns to find efficient embeddings of unlabeled data, with an encoder and decoder.
- Autoencoders compress data from higher to lower dimensions and back and is trained on loss function for reconstruction.
- VAE (Variational autoencoder) addresses non-regularized latent space, outputting parameters of a pre-defined distribution and imposes a constraint on this function, forcing it to be a normal distribution.
- Latent variable is smooth and continuous and defined by training loss which is the sum of the reconstruction loss and KL divergence between the unit gaussian and decoder output distribution.
Data Structures
- List: Ordered collection of elements accessed by index, in any order.
- Linked List: Elements with values and pointers, traversed sequentially.
- Stack: A sequential data structure accessed in LIFO order.
- Queue: A sequential data structure accessed in FIFO order.
- HashTable: Paired assignments accessed in constant time.
- Tree: Hierarchical relation between root, parent, child, and leaf nodes.
- Graph: A pair of sets (V, E) of vertices and edges, which can be cyclic.
Coding Interview Preparation
- Timeline: Start preparing 2-3 months, easier after some experience
- Review: Lists/Arrays, Linked List, Hash Table/Dictionary, Tree, Graph, Heap, Queue.
- Practice: on LeetCode.com, InterviewBit.com, HackerRank.com
- Listen, Talk, Discuss, Start Coding, Discuss, Optimize and Repeat
Behavioral Interview Preparation
-
The STAR method can be used to organize the stories one tells during and interview
-
Situation: provide the necessary context
-
Task: Explain what you were responsible for
-
Action: Provide steps that you took to address issue
-
Result: State outcome of your actions
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.