Podcast
Questions and Answers
Which of the following actions is most crucial in ensuring data integrity when transferring files between systems?
Which of the following actions is most crucial in ensuring data integrity when transferring files between systems?
- Encrypting the file to protect against unauthorized access.
- Implementing checksums or hash values to verify data. (correct)
- Using the fastest available network connection.
- Compressing the file to reduce transfer time.
In the context of data management, what is the primary benefit of using version control systems?
In the context of data management, what is the primary benefit of using version control systems?
- Automating data backups to prevent data loss.
- Encrypting sensitive data to protect it from unauthorized access.
- Compressing data to save storage space.
- Tracking and managing changes to data over time. (correct)
What is the most important consideration when selecting a data storage solution for long-term archival purposes?
What is the most important consideration when selecting a data storage solution for long-term archival purposes?
- The durability and longevity of the storage medium. (correct)
- The initial cost of the storage solution.
- The scalability of the storage solution.
- The speed of data retrieval.
Which of the following techniques is most effective for preventing SQL injection attacks?
Which of the following techniques is most effective for preventing SQL injection attacks?
What is the benefit of using data validation techniques?
What is the benefit of using data validation techniques?
What is the primary purpose of data normalization in database design?
What is the primary purpose of data normalization in database design?
Which of the following methods is most suitable for securely disposing of a hard drive containing sensitive data?
Which of the following methods is most suitable for securely disposing of a hard drive containing sensitive data?
What role that Metadata play in data governance?
What role that Metadata play in data governance?
Flashcards
What is an Array?
What is an Array?
A data structure where elements are arranged sequentially, each identified by an index.
What is Sorting?
What is Sorting?
The process of arranging elements in a specific order, often numerically or alphabetically.
What is Searching?
What is Searching?
A method for locating a specific element within a data structure.
What is a Graph?
What is a Graph?
Signup and view all the flashcards
What is a Chart?
What is a Chart?
Signup and view all the flashcards
What is Information?
What is Information?
Signup and view all the flashcards
What is a Database?
What is a Database?
Signup and view all the flashcards
What is a Loop?
What is a Loop?
Signup and view all the flashcards
Study Notes
Loss Function
- Measures the error between predicted and actual values in a model
- The goal is to minimize the error by optimizing model parameters
Likelihood and Probability
- Likelihood measures how likely a specific set of parameters explains the observed data
- Probability measures the likelihood of observing a certain outcome given fixed parameters
- Probability deals with data given fixed parameters
- Likelihood measures the "fit" of parameters given the observed data
- Equation for likelihood: L(θ|X) = product from i=1 to n of P(xi|θ) for data points X = {x1, x2,..., xn} and parameter θ
Maximum Likelihood Estimation (MLE)
- Purpose is to estimate the parameter θ that maximizes the likelihood of observing the given data
- θMLE = arg max L(θ|X) over θ
Negative Log-Likelihood (NLL)
- Transforms the product of probabilities (likelihood) into a sum for easier optimization.
- Minimize NLL instead of maximizing the likelihood
- Commonly used as a loss function in classification models, especially in probabilistic models
- Minimizing the NLL is equivalent to maximizing the likelihood as min(-log L(θ|X)) is equivalent to max L(θ|X)
Maximum A Posteriori (MAP)
- Purpose is to estimate the parameter θ that maximizes the posterior probability by incorporating prior knowledge
- Bayes' Rule defines the posterior: P(θ|X) = P(X|θ)P(θ) / P(X)
- Posterior = likelihood × prior / evidence
- The posterior is used when incorporating prior knowledge about a parameter
- Solve for θMAP = arg max P(θ|X) over θ to find the parameter using the posterior
- A prior is useful with a small sample size, when there is real background knowledge not included in the data, or when the prior can serve as a regularizer (e.g., Lasso and Ridge regularization)
Machine Learning Workflow
- Consists of having data, assuming the model has parameters, and optimizing a loss function to fit the model to the data.
Loss Functions
- Negative Log-Likelihood (NLL) is used in probabilistic models or classification tasks and minimizes the difference between predicted and actual probability distributions
- Sum or Mean Absolute Error (MAE) is appropriate when minimizing large deviations between actual and predicted values and is more robust to outliers than Mean Squared Error (MSE)
- Lasso (L1 Loss/Regularization) is used to promote sparsity in a model to force some coefficients to zero.
- Lasso is ideal for feature selection
- Ridge (L2 Loss/Regularization) penalizes large model coefficients without forcing them to zero and is used to prevent overfitting in regression models
Other Things to Study
- Review HW1 and examples from the end of Notebook 2 for reinforcement
Naive Bayes: Parameters, Features, and Labels
- Parameters are the underlying values the model learns (e.g., mean and variance in Gaussian Naive Bayes)
- Features are the input variables used for prediction
- Discrete Labels are the possible categories the model predicts (e.g., spam or not spam)
Starting Point of Naive Bayes Classifiers
- Naive Bayes classifiers begin with Bayes' Rule: P(Y|X) = P(X|Y)P(Y) / P(X)
- It conditions on the predictor, meaning it calculates the probability of the class given the observed features
Categorical Naive Bayes
- The main assumption is features are conditionally independent given the class label
- This assumption simplifies learning and reduces computational complexity by allowing computation of probabilities independently
- Steps include calculating the base rates (priors): P(Y), computing the probability of each class/predictor: P(X|Y), and dividing the counts of each feature k by the base rates to get the conditional probabilities
- The sleep deprivation and symptoms (mild, moderate, severe) example is used to represent categorical Naive Bayes
Pros and Cons of Naive Bayes Classification
- Pros: Doesn't require extensive training, relatively fast, works with both categorical and continuous data, and is not sensitive to irrelevant data
- Cons: Assigns zero probability if a categorical variable is missing in training, assumes strong independence, and probabilities can be misleading
When to Use Different Types of Naive Bayes
- Categorical Naive Bayes (CategoricalNB): When features are categorical (e.g., presence/absence of symptoms)
- Gaussian Naive Bayes (GaussianNB): When features are continuous and assumed to follow a normal distribution (e.g., height, weight, temperature)
- Bernoulli Naive Bayes (BernoulliNB): When features are binary (e.g., 0/1 values in text classification)
- Multinomial Naive Bayes (MultinomialNB): When dealing with count-based data, such as word frequencies in documents (e.g., spam detection)
- Complement Naive Bayes (ComplementNB): When class imbalances exist, meaning one class has significantly more examples than another (e.g., rare disease detection)
- Optimal Naive Bayes (OptimalNB): When an optimized form of Naive Bayes is needed, often tuned for specific datasets
Gaussian Naive Bayes (GNB) Assumption
- Instead of categorical probabilities, the feature values follow a normal (Gaussian) distribution
- P(Xj|Y) = N(Xj|µY, σY^2) where Xj is the feature, µY is the mean, and σY^2 is the variance of the feature within class Y
Numerical Issues: Underflow & Zero Probability
- Underflow in Naive Bayes happens when multiplying many small probabilities together, resulting in a value so small it rounds to zero due to floating-point limitations
- The logarithm trick can be used to solve the underflow problem by summing the logarithms instead of computing the product of probabilities: log P(Y|X) = log P(Y) + Σlog P(Xj|Y)
- The spam-ham example uses Multinomial Naive Bayes, because it models word counts
- The zero probability problem occurs when a word never appears in a given class in the training dataset, assigning it zero probability, which eliminates that class from consideration
- If we ignore data that doesn't appear in any of the classes and only use words that are in the training set in the spam-ham example then the model might be overconfident in its predictions and fail when encountering new words
Solutions to the Zero Probability Problem
- Additive Smoothing (Laplace Smoothing) is a solution to the zero probability problem by adding a small positive number a to all counts
- P(Xj|Y) = (count(Xj,Y) + α) / (Σj(count(Xj,Y) + a)) prevents zero probabilities without significantly altering large counts
- Need to choose the adjusting Smoothing Parameter α carefully (commonly α = 1) balances between handling zero probabilities and keeping real probability distributions intact.
Final Notes: What to Review?
- Review HW1 and examples from the end of Notebook 2
- Ensure understanding of probability assumptions, categorical vs. Gaussian Naive Bayes, and smoothing techniques
K-Nearest Neighbors (KNN)
- A non-parametric, instance-based learning algorithm that classifies a new data point based on the majority class of its K nearest neighbors
- KNN is a non-parametric algorithm, meaning it doesn't assume a specific functional form for the data, instead, it memorizes the training data and makes decisions based on similarity
- Increasing K reduces variance, smooths decision boundaries, and can lead to underfitting
- Decreasing K increases variance, makes model sensitive to noise, and can lead to overfitting
Overfitting and Underfitting
- Overfitting happens when a model learns noise in the training data rather than the actual pattern and performs well on the training set but poorly on unseen data
- the model memorizes noise rather than generalizing the underlying data distribution when overfitting
- Underfitting happens when a model is too simple to capture the pattern in the data, leading to high bias and poor performance on both training and test data.
Test Error and Model Complexity
- Test error is the error rate on unseen data, used to measure a model's generalization ability
- Overfitting showcases low training error and high test error
- Underfitting showcases high training error and high test error
Cross-Validation (CV)
- A technique for evaluating model performance by splitting data into multiple subsets and training/testing on different parts of the dataset.
- Cross-validation helps prevent overfitting, ensures the model generalizes well to unseen data, and helps select hyperparameters.
- Splitting data into training and test sets ensures that model evaluation is done on unseen data, giving a better measure of generalization.
- Leave-one-out cross-validation (LOO-CV) is a special case of k-fold cross-validation where each data point is used as a test set exactly once that's useful for small datasets but computationally expensive.
Regularization
- Prevents overfitting by adding a penalty to large weights in the model, controlling complexity
- L1 (Lasso) Regularization encourages sparse features by setting some coefficients to zero (feature selection) and is suitable when many features are irrelevant
- L2 (Ridge) Regularization shrinks weights smoothly (no zero coefficients) and is suitable when all features contribute but need smaller magnitudes
Bias-Variance Tradeoff
- Bias: Error due to simplified assumptions (e.g., underfitting)
- Variance: Error due to sensitivity to noise (e.g., overfitting)
- Increasing model complexity lowers bias but increases variance where the goal is to find an optimal balance
Gradient Descent
- The gradient is the rate of change of a function with respect to its parameters
- Mathematically, the gradient of a function f(x) is ∇f(x) = (∂f/∂x1, ..., ∂f/∂xn)
- Gradients on a contour plot show the steepest ascent direction
- If the gradient points towards steepest ascent, the function increases, which is the opposite of what is wanted in minimization problems
Gradient Descent Algorithm
- Initialize weights randomly, compute the gradient of the loss function, update parameters in the opposite direction of the gradient, and repeat until convergence
- Equation: θ(t+1) = θ(t) – α∇J(θ) where α = learning rate and ∇J(θ) = gradient of the loss function
- The learning rate controls how much weights are updated at each step
- Too smaller learning rate implies slow convergence while too large implies potentially overshooting and never converging
- Gradient descent is valid when the loss function is differentiable, and gradient updates move the function towards a minimum
- The learning rate can be choosen with trial and error or utilizing cross-validation(CV) and adaptive methods (e.g., Adam, RMSProp)
- The step size determines the direction of the gradient where too large may miss the optimal and too small will be slow
Key Points on Gradient of the Loss & Step Size for Optimization
- The gradient of the loss function tells how much and in which direction to adjust model parameters to minimize the error
- Mathematically, if the loss function is J(θ), the gradient is: ∇J(θ) = ∂J/∂θ
- In gradient descent, we update parameters using: θ(t+1) = θ(t) – α∇J(θ)
- Step size (learning rate α) controls how much the parameters are adjusted in the direction of the gradient
- Small step size → Slow convergence but stable learning
- Large step size → Faster learning but may overshoot and not converge
- Adaptive step sizes (e.g., Adam optimizer) adjust automatically
Supervised vs. Unsupervised Learning
- Supervised learning is when a model learns from labeled data, meaning each input has a corresponding known output (label) like classification(spam detection, image recognition) and regression (predicting housing prices)
- Unsupervised learning is when a model learns patterns from unlabeled data (i.e., no explicit output labels) like Clustering (Group customers based on purchase behavior) and Dimensionality Reduction (PCAfor feature extration)
- Supervised learning applies labeled data so the model can make predicitons
- Unsupervised learning applies unlabeled data and finds structure
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.