Recent Lessons

Show all results for ""

Untitled

Untitled

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following actions is most crucial in ensuring data integrity when transferring files between systems?

Encrypting the file to protect against unauthorized access.
Implementing checksums or hash values to verify data. (correct)
Using the fastest available network connection.
Compressing the file to reduce transfer time.

In the context of data management, what is the primary benefit of using version control systems?

Automating data backups to prevent data loss.
Encrypting sensitive data to protect it from unauthorized access.
Compressing data to save storage space.
Tracking and managing changes to data over time. (correct)

What is the most important consideration when selecting a data storage solution for long-term archival purposes?

The durability and longevity of the storage medium. (correct)
The initial cost of the storage solution.
The scalability of the storage solution.
The speed of data retrieval.

Which of the following techniques is most effective for preventing SQL injection attacks?

<p>Using parameterized queries or prepared statements. (D)</p> Signup and view all the answers

What is the benefit of using data validation techniques?

<p>To ensure data accuracy and consistency. (D)</p> Signup and view all the answers

What is the primary purpose of data normalization in database design?

<p>To reduce data redundancy and improve data integrity. (D)</p> Signup and view all the answers

Which of the following methods is most suitable for securely disposing of a hard drive containing sensitive data?

<p>Physically destroying the drive. (C)</p> Signup and view all the answers

What role that Metadata play in data governance?

<p>It serves as documentation, enabling better understanding, control, and management of data assets. (D)</p> Signup and view all the answers

Flashcards

What is an Array?

A data structure where elements are arranged sequentially, each identified by an index.

What is Sorting?

The process of arranging elements in a specific order, often numerically or alphabetically.

What is Searching?

A method for locating a specific element within a data structure.

What is a Graph?

A data structure where connections between nodes are represented, forming a network.

Signup and view all the flashcards

What is a Chart?

A visual representation of data, often used to easily identify patterns and trends.

Signup and view all the flashcards

What is Information?

Data that has been organized and structured to provide meaningful information.

Signup and view all the flashcards

What is a Database?

A system for organizing and managing collections of data, often stored electronically.

Signup and view all the flashcards

What is a Loop?

A programming construct that allows a block of code to be executed repeatedly.

Signup and view all the flashcards

Study Notes

Loss Function

Measures the error between predicted and actual values in a model
The goal is to minimize the error by optimizing model parameters

Likelihood and Probability

Likelihood measures how likely a specific set of parameters explains the observed data
Probability measures the likelihood of observing a certain outcome given fixed parameters
Probability deals with data given fixed parameters
Likelihood measures the "fit" of parameters given the observed data
Equation for likelihood: L(θ|X) = product from i=1 to n of P(xi|θ) for data points X = {x1, x2,..., xn} and parameter θ

Maximum Likelihood Estimation (MLE)

Purpose is to estimate the parameter θ that maximizes the likelihood of observing the given data
θMLE = arg max L(θ|X) over θ

Negative Log-Likelihood (NLL)

Transforms the product of probabilities (likelihood) into a sum for easier optimization.
Minimize NLL instead of maximizing the likelihood
Commonly used as a loss function in classification models, especially in probabilistic models
Minimizing the NLL is equivalent to maximizing the likelihood as min(-log L(θ|X)) is equivalent to max L(θ|X)

Maximum A Posteriori (MAP)

Purpose is to estimate the parameter θ that maximizes the posterior probability by incorporating prior knowledge
Bayes' Rule defines the posterior: P(θ|X) = P(X|θ)P(θ) / P(X)
Posterior = likelihood × prior / evidence
The posterior is used when incorporating prior knowledge about a parameter
Solve for θMAP = arg max P(θ|X) over θ to find the parameter using the posterior
A prior is useful with a small sample size, when there is real background knowledge not included in the data, or when the prior can serve as a regularizer (e.g., Lasso and Ridge regularization)

Machine Learning Workflow

Consists of having data, assuming the model has parameters, and optimizing a loss function to fit the model to the data.

Loss Functions

Negative Log-Likelihood (NLL) is used in probabilistic models or classification tasks and minimizes the difference between predicted and actual probability distributions
Sum or Mean Absolute Error (MAE) is appropriate when minimizing large deviations between actual and predicted values and is more robust to outliers than Mean Squared Error (MSE)
Lasso (L1 Loss/Regularization) is used to promote sparsity in a model to force some coefficients to zero.
Lasso is ideal for feature selection
Ridge (L2 Loss/Regularization) penalizes large model coefficients without forcing them to zero and is used to prevent overfitting in regression models

Other Things to Study

Review HW1 and examples from the end of Notebook 2 for reinforcement

Naive Bayes: Parameters, Features, and Labels

Parameters are the underlying values the model learns (e.g., mean and variance in Gaussian Naive Bayes)
Features are the input variables used for prediction
Discrete Labels are the possible categories the model predicts (e.g., spam or not spam)

Starting Point of Naive Bayes Classifiers

Naive Bayes classifiers begin with Bayes' Rule: P(Y|X) = P(X|Y)P(Y) / P(X)
It conditions on the predictor, meaning it calculates the probability of the class given the observed features

Categorical Naive Bayes

The main assumption is features are conditionally independent given the class label
This assumption simplifies learning and reduces computational complexity by allowing computation of probabilities independently
Steps include calculating the base rates (priors): P(Y), computing the probability of each class/predictor: P(X|Y), and dividing the counts of each feature k by the base rates to get the conditional probabilities
The sleep deprivation and symptoms (mild, moderate, severe) example is used to represent categorical Naive Bayes

Pros and Cons of Naive Bayes Classification

Pros: Doesn't require extensive training, relatively fast, works with both categorical and continuous data, and is not sensitive to irrelevant data
Cons: Assigns zero probability if a categorical variable is missing in training, assumes strong independence, and probabilities can be misleading

When to Use Different Types of Naive Bayes

Categorical Naive Bayes (CategoricalNB): When features are categorical (e.g., presence/absence of symptoms)
Gaussian Naive Bayes (GaussianNB): When features are continuous and assumed to follow a normal distribution (e.g., height, weight, temperature)
Bernoulli Naive Bayes (BernoulliNB): When features are binary (e.g., 0/1 values in text classification)
Multinomial Naive Bayes (MultinomialNB): When dealing with count-based data, such as word frequencies in documents (e.g., spam detection)
Complement Naive Bayes (ComplementNB): When class imbalances exist, meaning one class has significantly more examples than another (e.g., rare disease detection)
Optimal Naive Bayes (OptimalNB): When an optimized form of Naive Bayes is needed, often tuned for specific datasets

Gaussian Naive Bayes (GNB) Assumption

Instead of categorical probabilities, the feature values follow a normal (Gaussian) distribution
P(Xj|Y) = N(Xj|µY, σY^2) where Xj is the feature, µY is the mean, and σY^2 is the variance of the feature within class Y

Numerical Issues: Underflow & Zero Probability

Underflow in Naive Bayes happens when multiplying many small probabilities together, resulting in a value so small it rounds to zero due to floating-point limitations
The logarithm trick can be used to solve the underflow problem by summing the logarithms instead of computing the product of probabilities: log P(Y|X) = log P(Y) + Σlog P(Xj|Y)
The spam-ham example uses Multinomial Naive Bayes, because it models word counts
The zero probability problem occurs when a word never appears in a given class in the training dataset, assigning it zero probability, which eliminates that class from consideration
If we ignore data that doesn't appear in any of the classes and only use words that are in the training set in the spam-ham example then the model might be overconfident in its predictions and fail when encountering new words

Solutions to the Zero Probability Problem

Additive Smoothing (Laplace Smoothing) is a solution to the zero probability problem by adding a small positive number a to all counts
P(Xj|Y) = (count(Xj,Y) + α) / (Σj(count(Xj,Y) + a)) prevents zero probabilities without significantly altering large counts
Need to choose the adjusting Smoothing Parameter α carefully (commonly α = 1) balances between handling zero probabilities and keeping real probability distributions intact.

Final Notes: What to Review?

Review HW1 and examples from the end of Notebook 2
Ensure understanding of probability assumptions, categorical vs. Gaussian Naive Bayes, and smoothing techniques

K-Nearest Neighbors (KNN)

A non-parametric, instance-based learning algorithm that classifies a new data point based on the majority class of its K nearest neighbors
KNN is a non-parametric algorithm, meaning it doesn't assume a specific functional form for the data, instead, it memorizes the training data and makes decisions based on similarity
Increasing K reduces variance, smooths decision boundaries, and can lead to underfitting
Decreasing K increases variance, makes model sensitive to noise, and can lead to overfitting

Overfitting and Underfitting

Overfitting happens when a model learns noise in the training data rather than the actual pattern and performs well on the training set but poorly on unseen data
the model memorizes noise rather than generalizing the underlying data distribution when overfitting
Underfitting happens when a model is too simple to capture the pattern in the data, leading to high bias and poor performance on both training and test data.

Test Error and Model Complexity

Test error is the error rate on unseen data, used to measure a model's generalization ability
Overfitting showcases low training error and high test error
Underfitting showcases high training error and high test error

Cross-Validation (CV)

A technique for evaluating model performance by splitting data into multiple subsets and training/testing on different parts of the dataset.
Cross-validation helps prevent overfitting, ensures the model generalizes well to unseen data, and helps select hyperparameters.
Splitting data into training and test sets ensures that model evaluation is done on unseen data, giving a better measure of generalization.
Leave-one-out cross-validation (LOO-CV) is a special case of k-fold cross-validation where each data point is used as a test set exactly once that's useful for small datasets but computationally expensive.

Regularization

Prevents overfitting by adding a penalty to large weights in the model, controlling complexity
L1 (Lasso) Regularization encourages sparse features by setting some coefficients to zero (feature selection) and is suitable when many features are irrelevant
L2 (Ridge) Regularization shrinks weights smoothly (no zero coefficients) and is suitable when all features contribute but need smaller magnitudes

Bias-Variance Tradeoff

Bias: Error due to simplified assumptions (e.g., underfitting)
Variance: Error due to sensitivity to noise (e.g., overfitting)
Increasing model complexity lowers bias but increases variance where the goal is to find an optimal balance

Gradient Descent

The gradient is the rate of change of a function with respect to its parameters
Mathematically, the gradient of a function f(x) is ∇f(x) = (∂f/∂x1, ..., ∂f/∂xn)
Gradients on a contour plot show the steepest ascent direction
If the gradient points towards steepest ascent, the function increases, which is the opposite of what is wanted in minimization problems

Gradient Descent Algorithm

Initialize weights randomly, compute the gradient of the loss function, update parameters in the opposite direction of the gradient, and repeat until convergence
Equation: θ(t+1) = θ(t) – α∇J(θ) where α = learning rate and ∇J(θ) = gradient of the loss function
The learning rate controls how much weights are updated at each step
Too smaller learning rate implies slow convergence while too large implies potentially overshooting and never converging
Gradient descent is valid when the loss function is differentiable, and gradient updates move the function towards a minimum
The learning rate can be choosen with trial and error or utilizing cross-validation(CV) and adaptive methods (e.g., Adam, RMSProp)
The step size determines the direction of the gradient where too large may miss the optimal and too small will be slow

Key Points on Gradient of the Loss & Step Size for Optimization

The gradient of the loss function tells how much and in which direction to adjust model parameters to minimize the error
Mathematically, if the loss function is J(θ), the gradient is: ∇J(θ) = ∂J/∂θ
In gradient descent, we update parameters using: θ(t+1) = θ(t) – α∇J(θ)
Step size (learning rate α) controls how much the parameters are adjusted in the direction of the gradient
- Small step size → Slow convergence but stable learning
- Large step size → Faster learning but may overshoot and not converge
- Adaptive step sizes (e.g., Adam optimizer) adjust automatically

Supervised vs. Unsupervised Learning

Supervised learning is when a model learns from labeled data, meaning each input has a corresponding known output (label) like classification(spam detection, image recognition) and regression (predicting housing prices)
Unsupervised learning is when a model learns patterns from unlabeled data (i.e., no explicit output labels) like Clustering (Group customers based on purchase behavior) and Dimensionality Reduction (PCAfor feature extration)
Supervised learning applies labeled data so the model can make predicitons
Unsupervised learning applies unlabeled data and finds structure

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Machine Learning Cheat Sheets (PDF)

More Like This

Untitled

44 questions

Untitled

ExaltingAndradite

Untitled

6 questions

Untitled

StrikingParadise

Untitled

48 questions

Untitled

HilariousElegy8069

Untitled

49 questions

Untitled

MesmerizedJupiter

Use Quizgecko on...

Browser