Podcast
Questions and Answers
Which of the following file paths indicates a document named 'study_gude1.html'?
Which of the following file paths indicates a document named 'study_gude1.html'?
- file:///Users/Documents/study_guide.pdf
- file:///Root/System/Important/data.txt
- file:///Users/ash/Data_231/Untitled-1.html
- file:///Users/ash/Data_231/study_gude1.html (correct)
If a series of files are named 'Untitled-1.html', which attribute is most likely being sequentially updated to differentiate them?
If a series of files are named 'Untitled-1.html', which attribute is most likely being sequentially updated to differentiate them?
- A page or version number (correct)
- The user ID
- The directory path
- The file extension
Based on the file paths, what can be inferred about the user 'ash'?
Based on the file paths, what can be inferred about the user 'ash'?
- They have a directory named 'Data_231' for organizing files. (correct)
- They do not use the 'Users' directory.
- They are exclusively working with system files.
- They primarily work with PDF documents.
What is the most likely reason for the multiple files named 'Untitled-1.html'?
What is the most likely reason for the multiple files named 'Untitled-1.html'?
Which of the following is the most relevant to understanding the context of the listed files?
Which of the following is the most relevant to understanding the context of the listed files?
If several files are located under /Users/ash/Data_231/
, which statement is most likely true?
If several files are located under /Users/ash/Data_231/
, which statement is most likely true?
Given the series of HTML files 'Untitled-1.html', what does the '.html' suffix indicate?
Given the series of HTML files 'Untitled-1.html', what does the '.html' suffix indicate?
If 'study_gude1.html' and 'Untitled-1.html' exist in the same directory, which is likely true?
If 'study_gude1.html' and 'Untitled-1.html' exist in the same directory, which is likely true?
Flashcards
Not provided
Not provided
Not provided in the text.
Not provided
Not provided
Not provided in the text.
Not provided
Not provided
Not provided in the text.
Not provided
Not provided
Signup and view all the flashcards
Not provided
Not provided
Signup and view all the flashcards
Not provided
Not provided
Signup and view all the flashcards
Not provided
Not provided
Signup and view all the flashcards
Not provided
Not provided
Signup and view all the flashcards
Study Notes
Loss Function
- Measures the error between predicted values and actual values in a model
- The goal is to minimize error by optimizing model parameters
Likelihood and Probability
- Likelihood measures how likely a specific set of parameters explains observed data
- Probability measures the likelihood of observing an outcome given fixed parameters
- Probability deals with data given fixed parameters
- Likelihood measures the "fit" of parameters given observed data
- For a set of data points X = {x1, x2,..., xn} and parameter θ, the likelihood is L(θ|X) = product from i=1 to n of P(xi|θ)
Maximum Likelihood Estimation (MLE)
- Estimates the parameter θ that maximizes the likelihood of observing the given data
- θMLE = arg max L(θ|X)
Negative Log-Likelihood (NLL)
- Transforms the product of probabilities (likelihood) into a sum for easier optimization
- Minimizes NLL instead of maximizing the likelihood
- NLL is commonly used as a loss function in classification models, especially in probabilistic models
- Minimizing the NLL is equivalent to maximizing the likelihood
- min(-log L(θ|X)) is equivalent to max L(θ|X)
Maximum A Posteriori (MAP)
- Estimates the parameter θ that maximizes the posterior probability by incorporating prior knowledge
- Bayes' Rule defines the posterior as P(θ|X) = P(X|θ)P(θ) / P(X)
- Posterior = likelihood × prior / evidence
- Use the posterior when incorporating prior knowledge about a parameter
- To find the parameter using the posterior, solve θMAP = arg max P(θ|X)
When is a prior useful
- Small sample size
- Real background knowledge existing outside of current dataset
- When the prior serves as a regularizer (e.g., Lasso and Ridge regularization)
Machine Learning Workflow
- Data is acquired
- A model with parameters is selected
- A loss function is optimized to fit the model to the data
Loss Functions
- Four types
- Negative Log-Likelihood (NLL)
- Used in probabilistic models or classification tasks
- Minimizes the difference between predicted and actual probability distributions
- Sum or Mean Absolute Error (MAE)
- Appropriate when minimizing large deviations between actual and predicted values (e.g., regression tasks)
- More robust to outliers than Mean Squared Error (MSE)
- Lasso (L1 Loss/Regularization)
- Promotes sparsity in the model by forcing some coefficients to zero
- Great for feature selection
- Ridge (L2 Loss/Regularization)
- Penalizes large model coefficients without forcing them to zero
- Prevents overfitting in regression models
- Negative Log-Likelihood (NLL)
Naive Bayes: Parameters, Features, and Labels
- Parameters are the underlying values the model learns, like mean and variance in Gaussian Naive Bayes
- Features are the input variables to use for prediction
- Discrete Labels are the possible categories the model predicts such as spam or not spam
Starting Point of Naive Bayes Classifiers
- Start using Bayes' Rule
- P(Y|X) = P(X|Y)P(Y) / P(X)
- It conditions on the predictor, meaning it calculates the probability of the class given known features
Categorical Naive Bayes
- The main assumption is that features are conditionally independent given the class label
- This assumption simplifies learning because it computes probabilities independently, reducing computational complexity
- Steps for Categorical Naive Bayes:
- Calculate the base rates (priors): P(Y)
- Compute the probability of each class/predictor: P(X|Y)
- Divide the counts of each feature k by the base rates to get the conditional probabilities
- The sleep deprivation and symptoms (mild, moderate, severe) example
Pros of Naive Bayes Classification
- No extenstive training needed
- Relatively fast
- Works on both categorical and continuous data
- Insensitive to irrelevant data
Cons of Naive Bayes Classification
- A zero probability issue exists; a missing categorical variable in training will cause the model to assigns it zero probability
- It has a strong independence assumption, but in reality features are often correlated, which affects prediction accuracy
- Probabilities can be misleading because the actual values of computed probabilities are often incorrect
When to Use Different Types of Naive Bayes?
- Categorical Naive Bayes (CategoricalNB) is used when features are categorical (e.g., presence/absence of symptoms)
- Gaussian Naive Bayes (GaussianNB) is used when features are continuous and follow a normal distribution (e.g., height, weight, temperature)
- Bernoulli Naive Bayes (BernoulliNB) is used when features are binary (e.g., 0/1 values in text classification)
- Multinomial Naive Bayes (MultinomialNB) is used when dealing with count-based data, such as word frequencies in documents (e.g., spam detection)
- Complement Naive Bayes (ComplementNB) is used when class imbalances exist, meaning one class has significantly more examples than another (e.g., rare disease detection)
- Optimal Naive Bayes (OptimalNB) is used when an optimized form of Naive Bayes is needed, often tuned for specific datasets
Gaussian Naive Bayes (GNB) Assumption
- Rather than categorical probabilities, assumes feature values follow a normal (Gaussian) distribution: P(Xj|Y) = N(Xj|µY, σ^2Y), -Xj is the feature -µY is the mean -σ^2Y is the variance of the feature within class Y
Numerical Issues: Underflow & Zero Probability
- Underflow occurs when multiplying many small probabilities together; the result can be so small that it rounds to zero due to floating-point limitations
- Use the logarithm trick: Instead of computing the product of probabilities, sum their logarithms: log P(Y|X) = log P(Y) + ∑ log P(Xj|Y)
- The spam-ham example uses Multinomial Naive Bayes to models word counts
- The zero probability problem occurs when a word never appears in a given class in the training dataset, Naive Bayes assigns it zero probability, eliminating that class from consideration
- If data missing in the classes are ignored and only ones that do in the spam-ham example are used, the model might be overconfident in its predictions and fail when encountering new words
Solutions to the Zero Probability Problem
- Apply Additive Smoothing (Laplace Smoothing)
- Add a small positive number α to all counts: P(Xj|Y) = (count(Xj, Y) + α) / (Σj(count(Xj, Y) + α))
- This method prevents zero probabilities without significantly altering large counts
- Adjust Smoothing Parameter α
- Choosing α carefully (commonly α = 1) balances between handling zero probabilities and keeping real probability distributions intact
K-Nearest Neighbors (KNN)
- KNN is a non-parametric, instance-based learning algorithm
- KNN classifies a new data point based on the majority class of its K nearest neighbors
- KNN is a non-parametric algorithm, which means it does not assume a specific functional form for the data
- KNN memorizes the training data and makes decisions based on similarity
Increasing/Decreasing K in KNN
- Increasing K reduces variance and smooths decision boundaries, but can lead to underfitting
- Decreasing K increases variance and makes the model sensitive to noise, but can lead to overfitting
Overfitting
- Occurs when a model learns noise rather than patterns in training data
- Results in good performance on the training set but poor performance on unseen data
- Noise causes the memorization of noise rather than the generalization of data distribution
Underfitting
- Occurs when a model is too simple to capture the pattern in the data
- Leads to high bias and poor performance on both training and test data
Test Error
- Error rate on unseen data
- Used to measure a model's generalization ability
Overfitting vs. Underfitting (Train vs. Test Error)
- Overfitting: Low training error and high testing error
- Underfitting: High training error and high testing error
Cross-Validation (CV)
- A technique used to evaluate model performance by splitting data into multiple subsets and training/testing on different parts of the dataset
- Prevents overfitting, ensuring model generalizes well to unseen data, and helps select hyperparameters
Train-Test Split
- Occurs when splitting data into training and test sets ensures
- Model evaluation is done on unseen data
- Provides a better measure of generalization.
Leave-One-Out Cross-Validation (LOO-CV)
- A special case of k-fold cross-validation where each data point is used as a test set exactly once
- Useful for small datasets, but computationally expensive
Regularization
- Prevents overfitting
- Achieved by adding a penalty to large weights in the model
- Controls complexity
L1 vs. L2 Regularization
- L1 (Lasso) Regularization encourages sparse features by setting some coefficients to zero for feature selection
- Suitable when many features are irrelevant
- L2 (Ridge) Regularization shrinks weights smoothly with no zero coeffficients
- Suitable when all features contribute, but need smaller magnitudes
Bias-Variance Tradeoff
- Bias is the error due to simplified assumptions (e.g., underfitting)
- Variance is the error due to sensitivity to noise (e.g., overfitting)
- The tradeoff is that increasing model complexity lowers bias but increases variance
- The goal is to find an optimal balance
Gradient Descent
- The gradient is the rate of change of a function with respect to its parameters
- Mathematically, the gradient of a function f(x) is ∇f(x) = (∂f/∂x1, ..., ∂f/∂xn)
- Gradients on a contour plot the steepest ascent direction, or directions in the direction of increasing function value
- With the gradient pointing towards steepest ascent, the function increases which opposes minimization problems
Gradient Descent Algorithm
- Basic steps:
- Initialize weights randomly.
- Compute the gradient of the loss function.
- Update parameters in the opposite direction of the gradient.
- Repeat until convergence.
- Equation: θ(t+1) = θ(t) – α∇J(θ)
- where:
- α = learning rate
- ∇J(θ)) = gradient of the loss function
Learning Rate
- Controls how much weights are updated at each step
- If too small, convergence is slow
- If too large, the model might overshoot and never converge
When is Gradient Descent Valid
- The loss function must be differentiable and gradient updates must move the function towards a minimum
- To choose the learning rate:
- trial and error, using cross-validation (CV)
- adaptive methods (e.g., Adam, RMSProp)
Step Size
- This term determines how far to move in the direction of the gradient
- If too large, the model may miss the optimal point
- If too small, convergence will be slow
Gradient of the Loss
- The Gradient tells how much and the direction to adjust model parameters to minimize the error
- Mathematically, if the loss function is J(θ), the gradient is: ∇J(θ) = ∂J/∂θ
- Parameter update equation during gradient descent: θ(t+1) = θ(t) – α∇J(θ)
Step Size
- Controls how much parameters are adjusted in the direction of the gradient
- Small step size yields to slow convergence but stable learning
- A large step size yields faster learning but the model may overshoot and not converge
- Adaptive step sizes (e.g., Adam optimizer) adjust automatically
Supervised vs. Unsupervised Learning
- Supervised learning occurs when a model learns from labeled data, meaning each input has a corresponding known output (label).
- Examples: classification (spam detection, image recognition) and regression (predicting house prices)
- Unsupervised learning occurs when a model learns patterns from unlabeled data (i.e., no explicit output labels).
- Examples: clustering (grouping customers based on purchase behavior) and dimensionality reduction (PCA for feature extraction)
Applying Supervised vs. Unsupervised Learning
- Supervised learning applies when labeled data and corresponding predictions are desired
- Unsupervised learning applies when unlabeled data exists and structure needs to be determined
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This lesson explores HTML file paths, naming conventions, and common scenarios like the creation of multiple 'Untitled' files. It covers file differentiation using sequential updates and inferences about user file organization. Understanding file extensions and directory structures is key.