BUSA8001 Cheat Sheet PDF
Document Details
Tags
Summary
This document introduces various concepts related to machine learning and predictive analytics, including supervised and unsupervised learning, classification, regression, and clustering. It delves into different algorithms.
Full Transcript
BUSA8001 Cheat Sheet ==================== Week 1Introduction ------------------ 1. **What is meant by features, target, examples, training, and labelled datasets in the context of machine learning and predictive analytics?** - Features (X) can be referred to explanatory variable / pred...
BUSA8001 Cheat Sheet ==================== Week 1Introduction ------------------ 1. **What is meant by features, target, examples, training, and labelled datasets in the context of machine learning and predictive analytics?** - Features (X) can be referred to explanatory variable / predictor / input / independent variable/ a column in data matrix X - Target (Y) can be referred to the Label/ output / dependent variable/ response variable - Example is an observation from a sample - Training is model fitting (i.e. parameter estimation in linear regression) - Labelled datasets are datasets contained data on the target/label (Y) 2. **Why do we use vector and matrix notation in predictive analytics? ** - It makes the code run faster since many computer languages, including python, are optimised to deal with arrays (vectors and matrices) - Many mathematical formulas are written using matrix notation and writing them out term by term would be impractical 3. **What is supervised learning? What is unsupervised learning? What is the difference between the two?** - - - 4. **What is classification? What is regression? What is the difference between the two?** - - - 5. **What is meant by clustering?** - Clustering is a type of unsupervised learning where we attempt to group a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters) without having any prior knowledge of their group memberships. Week 2 Classification Algorithms P1 ----------------------------------- 1. **What is the classification problem? How is binary classification different from multiclass classification?** - Binary classification is a method used to classify the **class labels into 2 categories** only. In contrast multiclass classification can classify the data into more than two classes - Multiclass classification into *n* classes is done by building *n* binary classification algorithms, where each algorithm performs one-vs-rest classification task 2. **What is NumPy? Why do we use it instead of Pandas?** - NumPy is a library in Python that supports large, multidimensional arrays or matrices with many high-level mathematical functions to operate on the arrays.station - We use NumPy when we have to deal with **large multidimensional arrays** in a faster way. Meanwhile Pandas is useful for cleaning the data and the initial exploratory data analysis (EDA). 3. **What is the perceptron? What is the bias unit? What is a unit step function?** - - - 4. **What is a decision boundary?** - It is the line that separates 2 classes and classifies new data into those 2 classes based on their features and the trained model. 5. **What is Adaline? How is it different from the perceptron? ** - Adaline is a modification of the perceptron algorithm. - The main difference is how prediction errors are computed, and hence how the algorithms are optimized. - The weights in perceptron are updated by computing the errors using the Unit Step Function output compared to the true class labels - In Adaline, the weights are updated using the errors computed on the basis of the output from the linear activation function and true class labels - Adeline algorithm converges to the weights values that correspond to the minimums SSEs even when classes are not linearly separable while perceptron will keep updating forever 6. **What are hyperparameters? Provide examples?** - They are parameters which are set by the analyst and not optimized from the data, e.g. learning rate - Learning rate, number of epochs, initial weights 7. **What is meant by Gradient Descent? What does it do? How does it work? What is Stochastic Gradient Descent? How is it different from Gradient Descent? Why use it? ** - Optimisation algorithm which uses the gradient of the cost function (e.g. sum of squared errors in Adeline) to find optimal weights. Optimal weights are the weights that minimise the cost function which is a function of forecast errors - SGD updates weights for each observation instead of after computing the sum of all forecast errors (squared errors). Can be faster than GD Week 3 Classification Algorithm P2 ---------------------------------- 1. **Is there a classification algorithm that is able to produce best forecasts in all classification problems?** - As noted by the computer scientist D. Wolpert "No single classifier works best across all possible scenarios" 2. **What is scikit-learn?** - It is a Python module that contains machine learning libraries and sample data - See here for more information. [**https://scikit-learn.org/stable/**](https://scikit-learn.org/stable/) 3. **What is meant by Feature Scaling? Why do we not scale the target as well?** - - 1. 2. The target is categorical (e.g., 0 or 1), and the range of values is already in a sufficiently small range 4. **Why do we split our data into training and test datasets?** - To be able to test how the model performs on a new dataset, i.e. is it overfitted or underfitted - A model which is trained on a (training) dataset is expected to perform well on the same dataset, so we need another dataset (test) to validate forecast accuracy. 5. **What is a logistic regression? What is the extra information that logistic regression provides in comparison to other models we have studied so far? What are log-odds? What is the sigmoid function, and what is it used for? ** - Logistic regression is a classification algorithm - We can get conditional probabilities of belonging to a class - Log-odds = logit function of odds (see lecture notes) - Sigmoid function is the inverse of the logistic function (see lecture notes) 6. **How do we estimate a logistic regression? What is the likelihood function? How do we get the cost function associated with a logistic regression? ** - Estimate a logistic regression by using Maximum Likelihood Estimation (MLE) - estimating parameters of a probability distribution by maximizing a likelihood function so that the observed data is most probable. - Likelihood function = the joint probability distribution of sample under assumption that the errors are independent - The cost function is formed by taking the negative/inverse of the log likelihood (see lecture notes) 7. **What are the technical differences between Perceptron, Adaline and a logistic regression?** - z's are the same -\> linear functions of x's - Difference between Perceptron and Adaline: Its in how the models are trained - Perceptron: uses a step activation function to compute the errors - Adaline: uses linear activation function to compute the errors - Logistic regression uses z to model the log-odds and trains the weights via Maximum Likelihood - See lecture notes for a more detailed comparison 8. **What is overfitting? What is underfitting? What is the Bias-Variance Tradeoff?** - Overfitting occurs when a model learns the perceived but false patterns and noise in the training data to the extent that it performs poorly on new data, essentially. - Underfitting happens when a model cannot capture the underlying trend of the data, usually due to being too simple, resulting in poor performance on both training and new data. - The Bias-Variance Tradeoff is a fundamental principle that describes the tradeoff between the model\'s complexity (variance) and its accuracy in capturing underlying trends (bias), where ideally, one seeks to find an appropriate balance between bias and variance to achieve predictions as accurate as possible. 9. **What is regularization? How is it done? Why is it done?** \- Regularization is a technique used to prevent overfitting by adding a penalty on the larger magnitudes of model parameters. \- It\'s typically done by adding a regularization term to the loss function, such as L1 or L2 regularization (see lecture notes for equations). \- The primary reason for regularization is to introduce and control the trade-off between bias and variance, leading to more generalized models that perform better on unseen data. Week 4 Classification Algorithm ------------------------------- 1. **How is the Support Vector Machine (SVM) algorithm different from other classification algorithms?** - SVM is the only algorithm that aims to maximise the margin between the decision boundary that separates classes and the closest data points from each class, and thus obtain optimal weights. 2. **What is the margin in the context of SVM?** - The margin is the normalised distance between the positive and negative hyperplanes, which are determined by support vectors (data points which are closest to the decision boundary) ![A mathematical equation with numbers and symbols Description automatically generated with medium confidence](media/image3.png) 3. **What are slack variables, and when are they used?** - Slack variables are used in soft-margin classification - This allows the algorithm to converge (we can find optimal w) even when dealing with non linearly separable data 4. **What is meant by kernel methods? When are they used?** - They deal with linearly inseparable data - Compute a nonlinear function of the original features to provide additional dimensions in the feature space - Data becomes linearly separable in the higher dimensional space - This allows us to separate the two classes via a **linear** hyperplane that becomes a nonlinear decision boundary when we project is back to the original feature space 5. **What are decision trees?** - Decision trees are a type of supervised learning algorithm, which models decisions and their possible consequences as a tree-like structure. - Each node in the tree represents a feature, each branch represents a decision rule, and each leaf node represents the outcome. - The paths from root to leaf represent classification rules or regression paths based on feature values 6. **Can decision trees separate linearly inseparable classes?** - Decision trees can, unlike linear models, fit linearly inseparable datasets - Decision trees don't compute a linear decision boundary 7. **How are decision trees built?** 1. **Choose the Best Feature to Split On**: Evaluate all possible features and determine the best one to split the data on. The \"best\" feature is the one that most effectively separates the data into groups that are homogenous in terms of the target variable. This effectiveness is measured by information gain (IG) 2. **Split the Data**: Once the best feature is identified, divide the dataset into two or more subsets based on that feature\'s values. This can result in two child nodes in the case of binary decision trees, or more for decision trees that allow multiple branches from a single node. 3. **Recursively Repeat for Each Child Node:** Apply steps 1 and 2 to each child node, choosing the best feature for further splits and partitioning the data accordingly. This recursive process continues for each branch of the tree. 4. **Stopping Criteria:** To prevent the tree from growing indefinitely (which can lead to overfitting), define stopping criteria. Common criteria include: 1. The tree has reached a predetermined maximum depth. 1. A node has fewer than a minimum number of points to justify a further split. 1. A split does not significantly improve the homogeneity of a node (based on a minimum improvement threshold). 1. All the data points in a node belong to the same class (pure node). 8. **What is meant by tree pruning in the context of decision trees?** - Tree pruning in the context of decision trees is a technique used to reduce the complexity of the final model and thus help prevent overfitting. - Pruning aims to improve the model\'s generalization capabilities by removing parts of the tree that provide little to no value in predicting the target variable. - The process of pruning involves cutting back branches of the tree. There are two main types of pruning: 1. **Pre-Pruning (Early Stopping):** This approach involves stopping the tree from growing beyond a certain point during its initial construction. Criteria for pre-pruning may include stopping the tree growth when the tree reaches a maximum specified depth. 1. **Post-Pruning:** In contrast to pre-pruning, post-pruning involves first growing a full tree and then removing branches that contribute little to the tree\'s ability to classify instances correctly. 9. **What are the three measures of impurity we discussed in relation to decision trees and how do they compare?** - The three measures of impurity discussed in relation to decision trees are - Classification Error, - Entropy, and - Gini Impurity. These measures are used to evaluate the quality of a split in the decision tree and to decide how to divide the data at each node to achieve the most homogenous subgroups with respect to the target variable. See lecture notes for equations. 10. **What are random forests?** - Random Forests are ensembles of Decision Trees - They function by constructing a multitude of decision trees at training time. - For classification tasks, the output of the Random Forest is the class selected by most trees. The method combines the predictions from multiple decision tree models to reduce the amount of overfitting. 11. **How are random forests built?** Here\'s a step-by-step overview of how Random Forests are constructed: 1. **Bootstrap Sampling**: For each tree in the forest, a bootstrap sample is drawn from the original training dataset. This sample is created by randomly selecting observations with replacement, meaning the same observation can appear multiple times in the sample. This process ensures that each decision tree in the Random Forest is trained on a slightly different dataset. 2. **Random Feature Selection**: When growing each tree, at each split, instead of searching for the best feature among all features to split the data, Random Forest randomly selects a subset of the features. The size of the subset is typically a parameter set by the user. The best split is found within this subset. This introduces more diversity among the trees and contributes to lower correlation between the trees in the forest, enhancing the ensemble\'s overall performance. 3. **Building Decision Trees**: Each bootstrap sample is used to build a decision tree. Since the training dataset for each tree is different due to the bootstrap sampling, and only a subset of features is considered for splitting at each node, each tree in the forest ends up being different. These trees are grown to their maximum size without pruning, which is counter-intuitively controlled for overfitting by the ensemble method itself. 4. **Aggregating Trees\' Predictions**: After all the trees have been built, the Random Forest aggregates their predictions. For classification tasks, each tree \"votes\" for a class, and the class receiving the majority of votes becomes the model\'s prediction. 5. **Output Prediction**: The aggregated predictions from all trees are used to make a final prediction. Since the Random Forest combines multiple models, it usually performs better than individual decision trees, especially on complex datasets that are prone to overfitting. The model\'s robustness comes from the diversity of the trees and the averaging process, which smoothens the prediction, reducing variance without increasing bias significantly. 12. **What are the key hyperparameters we usually set in random forests?** - **Number of Trees (n\_estimators)**: This is the number of trees in the forest. Generally, more trees increase model performance and robustness but also computational cost. There\'s a point of diminishing returns where adding more trees has a minimal effect on improving model performance. - **Maximum Depth of the Trees (max\_depth)**: The maximum depth limits how deep the trees can grow. - **Bootstrap (bootstrap)**: Whether or not bootstrap samples are used when building trees. If not, the whole dataset is used to build each tree. Using bootstrap sampling usually improves model robustness through reducing variance. - **Criterion (criterion)**: The function used to measure the quality of a split. For classification, \"gini\" for Gini impurity and \"entropy\" for information gain are common choices. For regression, options like \"squared\_error\" for mean squared error are used. 13. **Explain the K-nearest Neighbors (KNN) classifier algorithm** 1. Choose the number of neighbors k and a distance metric 2. Find the-nearest neighbors of the data example we need to classify 3. Assign the class label by majority vote Week 5 Data Preprocessing ------------------------- 1. **Discuss two methods for dealing with missing data. What are their advantages and disadvantages?** 2. **What is categorical data? Which two types of categorical data are there and how are they different? ** 3. **How do we encode nominal features for use in machine learning? Why not just allocate different integer numbers to nominal features?** 4. **Why do we often scale features before using them in a machine learning algorithm?** 1. **Improving Gradient Descent Efficiency**: Many machine learning algorithms use gradient descent as an optimization technique to minimise the cost function. When features are on different scales, the cost function\'s shape can become elongated, causing the gradient descent algorithm to take longer to converge to the minimum. Scaling features to a similar range leads to faster convergence. 2. **Improving Learning Algorithm Performance**: Some algorithms, particularly those that use regularisation (e.g., L1 and L2), assume that all features are centered around zero and have a variance in the same order. Without scaling, features with larger scales could be weighted more heavily than those with smaller scales, potentially leading to a model that prioritises certain features over others incorrectly. 3. **Facilitating Feature Interpretation**: In models where the magnitude of feature coefficients indicates the importance of features (e.g., in linear models), scaling features can help in interpreting these coefficients in a more uniform and meaningful way. 5. **List two types of feature scaling we discussed in lecture notes, and discuss how they are different.** - - Limitations: It is sensitive to outliers. Since the scaling is based on the minimum and maximum values, extreme values can skew the scaling, compressing the majority of the data into a small portion of the scale. - - 6. **What is meant by overfitting?** 7. **What are possible solutions to overfitting?** - Collect mode data and re-train the model on a larger dataset - Dimensionality reduction: - Feature Selection: 1. 1. - 1. 1. 8. **List two types of regularisation we discussed in class and explain how they are different.** 9. **How are greedy algorithms suboptimal?** 10. **Explain Sequential Backward Selection** 1. 2. 3. 4. 5. 6. 11. **Explain the Random Forest algorithm. How may we judge which features are relevant in the context of a fitted random forest model.** 1. 2. 3. 4. - - Week 6 Dimensionality Reduction ------------------------------- 1. **What do we mean by dimensionality reduction in the context of ML models? Why is it needed?** - - 1. 1. 1. 2. **List 3 methods of dimensionality reduction and briefly discuss them.** 1. II. III. 3. **What is meant by feature extraction? How is it different from feature selection?** 4. **What is Principal Component Analysis (PCA), and what are principal components?** 5. **How are principal components extracted?** Principal Components are computed as follows: 1. **Standardisation:** The first step often involves standardising the range of the initial variables so that each one of them contributes equally to the analysis. 2. **Covariance Matrix Computation:** PCA computes the covariance matrix of the data to understand how the variables of the dataset are varying from the mean with respect to each other. 3. **Eigenvalue and Eigenvector Calculation:** The covariance matrix is then decomposed into its eigenvectors and eigenvalues. Eigenvectors and eigenvalues are mathematical constructs that are used to understand the direction and magnitude of variance in the data. Eigenvectors point in the direction of the largest variance of the data, while eigenvalues signify the magnitude of this variance in those directions. 4. **Choosing Principal Components:** The eigenvectors are sorted by their corresponding eigenvalues in descending order. This ranking is crucial because the size of the eigenvalue indicates the amount of variance carried by its eigenvector. The principal components are selected based on the largest eigenvalues, meaning those components that carry the most information (variance). 6. **What is Linear Discriminant Analysis (LDA)? How is LDA different from PCA?** 7. **What is Kernel Principal Component Analysis (KPCA)? How is it different from PCA?** Week 7 Model Evaluation and Hyperparameter Tuning ------------------------------------------------- 1. **What are pipelines in scikit-learn and how are they useful?** 1. 2. 3. 4. 2. **What do fit and predict methods of pipelines do?** - - A diagram of a process Description automatically generated 3. **What is the Holdout Method in the context of training machine learning models?** 1. 2. 3. 4. **What is K-fold cross validation?** 1. 2. 3. 4. 5. **Why is cross-validation needed?** - - 6. **How are hyperparameters tuned via grid-search?** - - - 1. 2. 3. 4. 5. 6. 7. **What are Learning Curves and what are they used for?** - - 8. **What are Validation Curves and what are they used for?** - - - - 9. **What is the confusion matrix?** - - - - - 10. **What are Precision, Recall and F1?** Week 8 Ensemble Learning ------------------------ 1. **What is Ensemble Forecasting, why is it used, and how does it work?** 2. **What is Majority and Plurality voting? ** 3. **What is hard voting and how does it work?** 4. **What is soft voting and how does it work?** 5. **What is Bagging and how does it work?** ![A diagram of a training set Description automatically generated](media/image5.png) 6. **What are weak learners? Give an example of a weak learner.** - - 7. **What is Adaptive Boosting and how does it work in classification problems?** 1. 2. 3. 4. 5. 6. 7. 8. **How is AdaBoost different from Bagging?** Week 9 Cluster Analysis ----------------------- 1. **Discuss two distinctions between unsupervised learning and supervised learning.** 1. - - 2. - - 2. **How are Centroids different from Medoids (provide definitions)?** 3. **Explain the K-Means method** 1. 2. 3. 4. 5. 4. **What is K-Means++?** - - 5. **What is Euclidean (and squared Euclidean distance)? Why do we use it?** - - 6. **What is the Elbow method used for?** 1. 2. 3. 4. 7. **What are Silhouette Plots in the context of clustering algorithms?** ** 8. What are the two methods for organising clusters as hierarchical trees?** 1. 2. **9. Explain the agglomerative clustering algorithm with the complete linkage method.** 1. - 2. - 3. - 4. - 5. - 6. - **10. What is a dendrogram?** **11. What is the DBSCAN algorithm** - - - 1. 2. 3. 4. Week 10 Regression Analysis P1 ------------------------------ 1. **Briefly explain and compare simple and multiple linear regressions.** - - - 2. **Explain what is meant by Exploratory Data Analysis, what it is used for, and provide four examples of it.** - - - 1. 2. 3. 4. 3. **What are scatter plots? What do we look for in them?** 4. **What are correlations? What is the correlation matrix, and how do we use it? ** - - - 1. - 2. - 3. - 4. - 5. **What is RANSAC? Why is it needed? Briefly explain the RANSAC algorithm.** 6. **Provide and explain three methods of evaluating the performance of linear regression models.** - - - - - - - - - - 7. **Why is regularisation used in regression models? How do we incorporate regularisation in regression models? Describe 3 linear regression models which implement regularisation (without equations).** 8. **What are polynomial regression models and when are they used? What are some potential disadvantages of polynomial regression models?** - - 9. **Why are decision trees incorporated in the regression analysis? List and discuss some advantages and disadvantages of decision tree regression models.** - - - - - - 10. **Explain what random forest regression is, and very briefly it works. List and briefly discuss its advantages and disadvantages.** - - - - - 11. **What is Stacking in regression analysis, and how does it work?** - - - - 12. **What is meant by the serialising of trained scikit-learn models? Why is it used?** - - - Week 11 Regression Analysis P2 ------------------------------ 1. **What is the main difference between time series data and cross-sectional data?** 2. **What is the main goal of time series forecasting?** 3. **What is the defining characteristic of a white noise process?** 4. **What does the autocorrelation function (ACF) measure?** 5. **In an AR(*p*) model, what does the parameter *p* represent?** 6. **Explain the purpose of using the Partial Autocorrelation Function (PACF).** 7. **What kind of model should be used if the current value of a time series depends only on past error terms?** 8. **What is the relationship between the ACF and MA(q) models?** 9. **What does the ARMA(p, q) model combine?** 10. **What does it mean for AR(p) models if the PACF cuts off after two lags?** 11. **What is the key difference between AR and MA models?** 12. **What is the difference between Mean Squared Error (MSE) and Mean Absolute Error (MAE) in evaluating model performance** - **Mean Squared Error (MSE)** calculates the average of the squared differences between the predicted and actual values. It penalises larger errors more heavily due to squaring, making it more sensitive to outliers. - **Mean Absolute Error (MAE)** calculates the average of the absolute differences between the predicted and actual values. It treats all errors equally, making it less sensitive to outliers compared to MSE. - In summary, **MSE** gives more weight to larger errors, while **MAE** gives equal weight to all errors. Week 12 Sentiment Analysis -------------------------- **1. Briefly explain what is meant by Sentiment Analysis in the context of machine learning. Provide one example of how sentiment analysis can be applied.** 2. **What is the Bag-of-Words model?** - - 3. **What is an N-gram model?** 4. **Briefly explain what Term Frequency-Inverse Document Frequency is used for (without formulas).** 5. **What is meant by a regular expression in Python?** 6. **What is a token in NLP?** 7. **What is Word Stemming? Provide an example of stemming.** 8. **What are Stop-Words? Provide several examples of stop-words.** 9. **What is the main difference between Count Vectorizer and Hashing Vectorizer?** 10. **What is meant by Topic Modeling?** 11. **What is out-of-core learning?**