Untitled
8 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following actions is most crucial in ensuring data integrity when transferring files between systems?

  • Encrypting the file to protect against unauthorized access.
  • Implementing checksums or hash values to verify data. (correct)
  • Using the fastest available network connection.
  • Compressing the file to reduce transfer time.

In the context of data management, what is the primary benefit of using version control systems?

  • Automating data backups to prevent data loss.
  • Encrypting sensitive data to protect it from unauthorized access.
  • Compressing data to save storage space.
  • Tracking and managing changes to data over time. (correct)

What is the most important consideration when selecting a data storage solution for long-term archival purposes?

  • The durability and longevity of the storage medium. (correct)
  • The initial cost of the storage solution.
  • The scalability of the storage solution.
  • The speed of data retrieval.

Which of the following techniques is most effective for preventing SQL injection attacks?

<p>Using parameterized queries or prepared statements. (D)</p> Signup and view all the answers

What is the benefit of using data validation techniques?

<p>To ensure data accuracy and consistency. (D)</p> Signup and view all the answers

What is the primary purpose of data normalization in database design?

<p>To reduce data redundancy and improve data integrity. (D)</p> Signup and view all the answers

Which of the following methods is most suitable for securely disposing of a hard drive containing sensitive data?

<p>Physically destroying the drive. (C)</p> Signup and view all the answers

What role that Metadata play in data governance?

<p>It serves as documentation, enabling better understanding, control, and management of data assets. (D)</p> Signup and view all the answers

Flashcards

What is an Array?

A data structure where elements are arranged sequentially, each identified by an index.

What is Sorting?

The process of arranging elements in a specific order, often numerically or alphabetically.

What is Searching?

A method for locating a specific element within a data structure.

What is a Graph?

A data structure where connections between nodes are represented, forming a network.

Signup and view all the flashcards

What is a Chart?

A visual representation of data, often used to easily identify patterns and trends.

Signup and view all the flashcards

What is Information?

Data that has been organized and structured to provide meaningful information.

Signup and view all the flashcards

What is a Database?

A system for organizing and managing collections of data, often stored electronically.

Signup and view all the flashcards

What is a Loop?

A programming construct that allows a block of code to be executed repeatedly.

Signup and view all the flashcards

Study Notes

Loss Function

  • Measures the error between predicted and actual values in a model
  • The goal is to minimize the error by optimizing model parameters

Likelihood and Probability

  • Likelihood measures how likely a specific set of parameters explains the observed data
  • Probability measures the likelihood of observing a certain outcome given fixed parameters
  • Probability deals with data given fixed parameters
  • Likelihood measures the "fit" of parameters given the observed data
  • Equation for likelihood: L(θ|X) = product from i=1 to n of P(xi|θ) for data points X = {x1, x2,..., xn} and parameter θ

Maximum Likelihood Estimation (MLE)

  • Purpose is to estimate the parameter θ that maximizes the likelihood of observing the given data
  • θMLE = arg max L(θ|X) over θ

Negative Log-Likelihood (NLL)

  • Transforms the product of probabilities (likelihood) into a sum for easier optimization.
  • Minimize NLL instead of maximizing the likelihood
  • Commonly used as a loss function in classification models, especially in probabilistic models
  • Minimizing the NLL is equivalent to maximizing the likelihood as min(-log L(θ|X)) is equivalent to max L(θ|X)

Maximum A Posteriori (MAP)

  • Purpose is to estimate the parameter θ that maximizes the posterior probability by incorporating prior knowledge
  • Bayes' Rule defines the posterior: P(θ|X) = P(X|θ)P(θ) / P(X)
  • Posterior = likelihood × prior / evidence
  • The posterior is used when incorporating prior knowledge about a parameter
  • Solve for θMAP = arg max P(θ|X) over θ to find the parameter using the posterior
  • A prior is useful with a small sample size, when there is real background knowledge not included in the data, or when the prior can serve as a regularizer (e.g., Lasso and Ridge regularization)

Machine Learning Workflow

  • Consists of having data, assuming the model has parameters, and optimizing a loss function to fit the model to the data.

Loss Functions

  • Negative Log-Likelihood (NLL) is used in probabilistic models or classification tasks and minimizes the difference between predicted and actual probability distributions
  • Sum or Mean Absolute Error (MAE) is appropriate when minimizing large deviations between actual and predicted values and is more robust to outliers than Mean Squared Error (MSE)
  • Lasso (L1 Loss/Regularization) is used to promote sparsity in a model to force some coefficients to zero.
  • Lasso is ideal for feature selection
  • Ridge (L2 Loss/Regularization) penalizes large model coefficients without forcing them to zero and is used to prevent overfitting in regression models

Other Things to Study

  • Review HW1 and examples from the end of Notebook 2 for reinforcement

Naive Bayes: Parameters, Features, and Labels

  • Parameters are the underlying values the model learns (e.g., mean and variance in Gaussian Naive Bayes)
  • Features are the input variables used for prediction
  • Discrete Labels are the possible categories the model predicts (e.g., spam or not spam)

Starting Point of Naive Bayes Classifiers

  • Naive Bayes classifiers begin with Bayes' Rule: P(Y|X) = P(X|Y)P(Y) / P(X)
  • It conditions on the predictor, meaning it calculates the probability of the class given the observed features

Categorical Naive Bayes

  • The main assumption is features are conditionally independent given the class label
  • This assumption simplifies learning and reduces computational complexity by allowing computation of probabilities independently
  • Steps include calculating the base rates (priors): P(Y), computing the probability of each class/predictor: P(X|Y), and dividing the counts of each feature k by the base rates to get the conditional probabilities
  • The sleep deprivation and symptoms (mild, moderate, severe) example is used to represent categorical Naive Bayes

Pros and Cons of Naive Bayes Classification

  • Pros: Doesn't require extensive training, relatively fast, works with both categorical and continuous data, and is not sensitive to irrelevant data
  • Cons: Assigns zero probability if a categorical variable is missing in training, assumes strong independence, and probabilities can be misleading

When to Use Different Types of Naive Bayes

  • Categorical Naive Bayes (CategoricalNB): When features are categorical (e.g., presence/absence of symptoms)
  • Gaussian Naive Bayes (GaussianNB): When features are continuous and assumed to follow a normal distribution (e.g., height, weight, temperature)
  • Bernoulli Naive Bayes (BernoulliNB): When features are binary (e.g., 0/1 values in text classification)
  • Multinomial Naive Bayes (MultinomialNB): When dealing with count-based data, such as word frequencies in documents (e.g., spam detection)
  • Complement Naive Bayes (ComplementNB): When class imbalances exist, meaning one class has significantly more examples than another (e.g., rare disease detection)
  • Optimal Naive Bayes (OptimalNB): When an optimized form of Naive Bayes is needed, often tuned for specific datasets

Gaussian Naive Bayes (GNB) Assumption

  • Instead of categorical probabilities, the feature values follow a normal (Gaussian) distribution
  • P(Xj|Y) = N(Xj|µY, σY^2) where Xj is the feature, µY is the mean, and σY^2 is the variance of the feature within class Y

Numerical Issues: Underflow & Zero Probability

  • Underflow in Naive Bayes happens when multiplying many small probabilities together, resulting in a value so small it rounds to zero due to floating-point limitations
  • The logarithm trick can be used to solve the underflow problem by summing the logarithms instead of computing the product of probabilities: log P(Y|X) = log P(Y) + Σlog P(Xj|Y)
  • The spam-ham example uses Multinomial Naive Bayes, because it models word counts
  • The zero probability problem occurs when a word never appears in a given class in the training dataset, assigning it zero probability, which eliminates that class from consideration
  • If we ignore data that doesn't appear in any of the classes and only use words that are in the training set in the spam-ham example then the model might be overconfident in its predictions and fail when encountering new words

Solutions to the Zero Probability Problem

  • Additive Smoothing (Laplace Smoothing) is a solution to the zero probability problem by adding a small positive number a to all counts
  • P(Xj|Y) = (count(Xj,Y) + α) / (Σj(count(Xj,Y) + a)) prevents zero probabilities without significantly altering large counts
  • Need to choose the adjusting Smoothing Parameter α carefully (commonly α = 1) balances between handling zero probabilities and keeping real probability distributions intact.

Final Notes: What to Review?

  • Review HW1 and examples from the end of Notebook 2
  • Ensure understanding of probability assumptions, categorical vs. Gaussian Naive Bayes, and smoothing techniques

K-Nearest Neighbors (KNN)

  • A non-parametric, instance-based learning algorithm that classifies a new data point based on the majority class of its K nearest neighbors
  • KNN is a non-parametric algorithm, meaning it doesn't assume a specific functional form for the data, instead, it memorizes the training data and makes decisions based on similarity
  • Increasing K reduces variance, smooths decision boundaries, and can lead to underfitting
  • Decreasing K increases variance, makes model sensitive to noise, and can lead to overfitting

Overfitting and Underfitting

  • Overfitting happens when a model learns noise in the training data rather than the actual pattern and performs well on the training set but poorly on unseen data
  • the model memorizes noise rather than generalizing the underlying data distribution when overfitting
  • Underfitting happens when a model is too simple to capture the pattern in the data, leading to high bias and poor performance on both training and test data.

Test Error and Model Complexity

  • Test error is the error rate on unseen data, used to measure a model's generalization ability
  • Overfitting showcases low training error and high test error
  • Underfitting showcases high training error and high test error

Cross-Validation (CV)

  • A technique for evaluating model performance by splitting data into multiple subsets and training/testing on different parts of the dataset.
  • Cross-validation helps prevent overfitting, ensures the model generalizes well to unseen data, and helps select hyperparameters.
  • Splitting data into training and test sets ensures that model evaluation is done on unseen data, giving a better measure of generalization.
  • Leave-one-out cross-validation (LOO-CV) is a special case of k-fold cross-validation where each data point is used as a test set exactly once that's useful for small datasets but computationally expensive.

Regularization

  • Prevents overfitting by adding a penalty to large weights in the model, controlling complexity
  • L1 (Lasso) Regularization encourages sparse features by setting some coefficients to zero (feature selection) and is suitable when many features are irrelevant
  • L2 (Ridge) Regularization shrinks weights smoothly (no zero coefficients) and is suitable when all features contribute but need smaller magnitudes

Bias-Variance Tradeoff

  • Bias: Error due to simplified assumptions (e.g., underfitting)
  • Variance: Error due to sensitivity to noise (e.g., overfitting)
  • Increasing model complexity lowers bias but increases variance where the goal is to find an optimal balance

Gradient Descent

  • The gradient is the rate of change of a function with respect to its parameters
  • Mathematically, the gradient of a function f(x) is ∇f(x) = (∂f/∂x1, ..., ∂f/∂xn)
  • Gradients on a contour plot show the steepest ascent direction
  • If the gradient points towards steepest ascent, the function increases, which is the opposite of what is wanted in minimization problems

Gradient Descent Algorithm

  • Initialize weights randomly, compute the gradient of the loss function, update parameters in the opposite direction of the gradient, and repeat until convergence
  • Equation: θ(t+1) = θ(t) – α∇J(θ) where α = learning rate and ∇J(θ) = gradient of the loss function
  • The learning rate controls how much weights are updated at each step
  • Too smaller learning rate implies slow convergence while too large implies potentially overshooting and never converging
  • Gradient descent is valid when the loss function is differentiable, and gradient updates move the function towards a minimum
  • The learning rate can be choosen with trial and error or utilizing cross-validation(CV) and adaptive methods (e.g., Adam, RMSProp)
  • The step size determines the direction of the gradient where too large may miss the optimal and too small will be slow

Key Points on Gradient of the Loss & Step Size for Optimization

  • The gradient of the loss function tells how much and in which direction to adjust model parameters to minimize the error
  • Mathematically, if the loss function is J(θ), the gradient is: ∇J(θ) = ∂J/∂θ
  • In gradient descent, we update parameters using: θ(t+1) = θ(t) – α∇J(θ)
  • Step size (learning rate α) controls how much the parameters are adjusted in the direction of the gradient
    • Small step size → Slow convergence but stable learning
    • Large step size → Faster learning but may overshoot and not converge
    • Adaptive step sizes (e.g., Adam optimizer) adjust automatically

Supervised vs. Unsupervised Learning

  • Supervised learning is when a model learns from labeled data, meaning each input has a corresponding known output (label) like classification(spam detection, image recognition) and regression (predicting housing prices)
  • Unsupervised learning is when a model learns patterns from unlabeled data (i.e., no explicit output labels) like Clustering (Group customers based on purchase behavior) and Dimensionality Reduction (PCAfor feature extration)
  • Supervised learning applies labeled data so the model can make predicitons
  • Unsupervised learning applies unlabeled data and finds structure

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Untitled Quiz
6 questions

Untitled Quiz

AdoredHealing avatar
AdoredHealing
Untitled
44 questions

Untitled

ExaltingAndradite avatar
ExaltingAndradite
Untitled
6 questions

Untitled

StrikingParadise avatar
StrikingParadise
Untitled Quiz
50 questions

Untitled Quiz

JoyousSulfur avatar
JoyousSulfur
Use Quizgecko on...
Browser
Browser