Machine Learning Notes PDF

Here's a refined version of your notes in a clear Question & Answer (Q&A) format for better readability and understanding. Loss Functions & Likelihood What is a loss function? A function that quantifies how far off a model's predictions are from the actual values. What is the goal of a loss function? To minimize the error between predicted and actual values. What is the definition of likelihood, and how is it different from probability? Likelihood: Measures how well a particular set of parameters explains observed data. Probability: Measures the chance of an event occurring before seeing the data. What is the equation for likelihood? For independent data points x1,x2,...,xnx_1, x_2,..., x_n, the likelihood function is: L(θ)=P(x1,x2,...,xn∣θ)=∏i=1nP(xi∣θ)L(\theta) = P(x_1, x_2,..., x_n | \theta) = \prod_{i=1}^{n} P(x_i | \theta) What is the purpose of the maximum likelihood function (MLE)? To estimate the parameter θ\theta that maximizes the likelihood function. What is the purpose of the negative log-likelihood (NLL), and how is it related to MLE? The NLL is used to simplify optimization by converting products into sums: −log⁡L(θ)=−∑i=1nlog⁡P(xi∣θ)-\log L(\theta) = -\sum_{i=1}^{n} \log P(x_i | \theta) It is commonly used as a loss function in machine learning. What is the purpose of Maximum a Posteriori (MAP) estimation? To estimate parameters while incorporating prior knowledge, using Bayes' Rule. How is the posterior related to Bayes' Rule? Bayes' Rule defines the posterior: P(θ∣X)=P(X∣θ)P(θ)P(X)P(\theta | X) = \frac{P(X | \theta) P(\theta)}{P(X)} When do we want to use the posterior? When we have prior knowledge about a parameter. How do we find the parameter using the posterior? By maximizing the posterior: θ∗=arg⁡max⁡θP(θ∣X)\theta^* = \arg\max_{\theta} P(\theta | X) When is a prior useful? 1. When the sample size is small. 2. When we have real background knowledge about the parameter. 3. When it can act as a regularizer to prevent overfitting. Machine Learning (ML) Workflow What is the general workflow of ML systems? 1. Collect data 2. Define a model with parameters 3. Optimize a loss function to train the model What are four different loss functions and when should they be used? 1. Negative Log-Likelihood (NLL) - Probabilistic models 2. Sum or Mean Absolute Error (MAE) - Robust to outliers 3. Lasso (L1 Regularization) - Feature selection, sparse models 4. Ridge (L2 Regularization) - Reducing model complexity Naive Bayes & Gaussians What are parameters, features, and discrete labels? Parameters: Values the model learns (e.g., probabilities in Naive Bayes). Features: The input variables. Discrete Labels: The output categories. How do Naive Bayes classifiers begin? They apply Bayes' Rule and condition on the predictor. What is the main assumption of categorical Naive Bayes? That features are conditionally independent given the class. Why does this assumption help? It simplifies the model and makes learning easier. What are the steps for Categorical Naive Bayes? 1. Compute base rates/prior probabilities for each class. 2. Compute conditional probabilities for each feature. 3. Divide counts by base rates to get probabilities. What example was used in class for CategoricalNB? A sleep deprivation and symptoms example. What are the pros of Naive Bayes classification? No need for extensive training. Works with categorical & continuous data. Fast and efficient. Robust to irrelevant features. What are the cons of Naive Bayes classification? Zero probability problem: If a feature is absent in training, it gets zero probability. Strong independence assumption, which is rarely true. Probability estimates are unreliable, though relative rankings are useful. When should different Naive Bayes classifiers be used? CategoricalNB: For discrete categorical features. GaussianNB: When features are continuous and follow a normal distribution. BernoulliNB: When features are binary. MultinomialNB: For count-based features (e.g., text data). ComplementNB: For imbalanced class distributions. What new assumption do we make for Gaussian Naive Bayes? Each feature follows a normal distribution: P(xj∣y)=N(μy,σy2)P(x_j | y) = \mathcal{N}(\mu_y, \sigma_y^2) What’s the underflow problem, and how can we solve it? Problem: Multiplying small probabilities leads to numerical issues. Solution: Use log probabilities instead. Classification Metrics What is a confusion matrix for binary classification? A table showing correct and incorrect predictions: Actual \ Predicted Positiv Negativ e e Positive TP FN Negative FP TN What are type I and type II errors? Type I (False Positive, FP): Predicting positive when it's actually negative. Type II (False Negative, FN): Predicting negative when it's actually positive. What is accuracy and its equation? Accuracy = TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN} Drawback: Can be misleading with imbalanced data. What is precision and its equation? Precision = TPTP+FP\frac{TP}{TP + FP} Higher precision means fewer false positives. What is recall and its equation? Recall = TPTP+FN\frac{TP}{TP + FN} Higher recall means fewer false negatives. What is the precision-recall tradeoff? Improving one often reduces the other. What is the F1-score and its equation? Harmonic mean of precision and recall: F1=2×Precision×RecallPrecision+RecallF1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} Low F1-score = poor model performance. What is the ROC curve? A plot of True Positive Rate (TPR) vs. False Positive Rate (FPR). Logistic Regression What is logistic regression? A model that predicts probabilities using the sigmoid function. What is the loss function for logistic regression? The cross-entropy loss, derived from MLE. What is cross-entropy loss? −∑ylog⁡y^+(1−y)log⁡(1−y^)- \sum y \log \hat{y} + (1 - y) \log (1 - \hat{y}) Gradient Descent What is the gradient, conceptually and mathematically? The direction of the steepest ascent of a function. Mathematically: ∇f(x)=(∂f∂x1,∂f∂x2,...,∂f∂xn)\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2},..., \frac{\partial f}{\partial x_n} \right) What is the gradient descent algorithm? Iteratively update parameters: θ=θ−α∇L(θ)\theta = \theta - \alpha \nabla L(\theta) where α\alpha is the learning rate. How do we choose the learning rate? Trial and error Cross-validation What happens if the learning rate is too high or too low? Too high: May diverge. Too low: Slow convergence. 🚀 This version keeps everything structured, concise, and informative for studying! Let me know if you want any refinements.

Machine Learning Notes PDF

Document Details

Tags

Related

Summary

Full Transcript