Full Transcript

# Machine Learning Cheat Sheet ## Supervised Learning Given a set of data points $x$ with corresponding labels $y$, learn a function to predict the labels for new data points. ### Notation - $x$: input data - $y$: label - $m$: number of training examples - $n$: number of features - $h$: hypothesi...

# Machine Learning Cheat Sheet ## Supervised Learning Given a set of data points $x$ with corresponding labels $y$, learn a function to predict the labels for new data points. ### Notation - $x$: input data - $y$: label - $m$: number of training examples - $n$: number of features - $h$: hypothesis (our prediction) ### Linear Regression - Predict a continuous value. - Hypothesis: $h(x) = \theta^T x = \theta_0 + \theta_1 x_1 +... + \theta_n x_n$ - Cost function: $J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2$ - Gradient descent: $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})x_j^{(i)}$ - Normal equation: $\theta = (X^T X)^{-1} X^T y$ ### Logistic Regression - Predict a binary value. - Hypothesis: $h(x) = g(\theta^T x)$, where $g(z) = \frac{1}{1 + e^{-z}}$ (sigmoid function) - Cost function: $J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)} \log(h(x^{(i)})) + (1 - y^{(i)}) \log(1 - h(x^{(i)}))$ - Gradient descent: $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})x_j^{(i)}$ ### Regularization - Add a penalty term to the cost function to prevent overfitting. - Cost function: $J(\theta) = \frac{1}{2m} [\sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^n \theta_j^2]$ - Gradient descent: $\theta_j := \theta_j - \alpha [\frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m} \theta_j]$ for $j > 0$ - Normal equation: $\theta = (X^T X + \lambda A)^{-1} X^T y$, where $A$ is a matrix with 1s on the diagonal (except for the first element) and 0s elsewhere. ### Neural Networks - A series of interconnected nodes (neurons) organized in layers. - Each connection has a weight associated with it. - Activation functions introduce non-linearity. - Forward propagation: compute the output of the network. - Backpropagation: compute the gradient of the cost function with respect to the weights. - Gradient descent: update the weights. ## Unsupervised Learning Given a set of data points $x$, learn the underlying structure of the data. ### K-Means Clustering - Partition the data into $K$ clusters. - Randomly initialize $K$ cluster centroids. - Assign each data point to the closest centroid. - Recompute the centroids as the mean of the data points in each cluster. - Repeat until convergence. - Cost function: $J(c, \mu) = \frac{1}{m} \sum_{i=1}^m ||x^{(i)} - \mu_{c^{(i)}}||^2$ ### Principal Component Analysis (PCA) - Reduce the dimensionality of the data while preserving the most important information. - Compute the covariance matrix of the data. - Compute the eigenvectors of the covariance matrix. - Select the $k$ eigenvectors corresponding to the largest eigenvalues. - Project the data onto the subspace spanned by the selected eigenvectors. ## Anomaly Detection Identify data points that are significantly different from the rest of the data. ### Gaussian Distribution - Model the data using a Gaussian distribution. - Compute the mean and variance of the data. - Estimate the probability of a new data point. - If the probability is below a certain threshold, classify the data point as an anomaly. - $p(x) = \prod_{j=1}^n p(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^n \frac{1}{\sqrt{2\pi\sigma_j^2}} \exp(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})$ ## Evaluation Assess the performance of a machine learning model. ### Metrics - Accuracy: $\frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$ - Precision: $\frac{\text{True Positives}}{\text{True Positives + False Positives}}$ - Recall: $\frac{\text{True Positives}}{\text{True Positives + False Negatives}}$ - F1-score: $2 \frac{\text{Precision * Recall}}{\text{Precision + Recall}}$ - Root Mean Squared Error (RMSE): $\sqrt{\frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2}$ - R-squared: $1 - \frac{\sum_{i=1}^m (y^{(i)} - h(x^{(i)}))^2}{\sum_{i=1}^m (y^{(i)} - \bar{y})^2}$ ### Bias vs Variance - Bias: the difference between the average prediction of our model and the correct value. - Variance: the variability in the model prediction for a given data point. - High bias: underfitting. - High variance: overfitting. ### Train/Validation/Test Sets - Divide the data into three sets: training, validation, and test. - Train the model on the training set. - Evaluate the model on the validation set to tune hyperparameters. - Evaluate the final model on the test set to estimate its generalization performance. ## Tips and Tricks ### Feature Scaling - Scale the features to have similar ranges. - Prevents features with large ranges from dominating the learning process. - Common methods: - Standardization: $x_i := \frac{x_i - \mu}{\sigma}$ - Min-max scaling: $x_i := \frac{x_i - \min(x)}{\max(x) - \min(x)}$ ### Learning Rate - Choose an appropriate learning rate for gradient descent. - Too small: slow convergence. - Too large: may overshoot the minimum. - Can use learning rate decay to reduce the learning rate over time. ### Regularization - Choose an appropriate regularization parameter $\lambda$. - Too small: overfitting. - Too large: underfitting. - Can use cross-validation to choose the best value of $\lambda$.