BUSA3020 Revision PDF

Week 1: Introduction Explanatory vs Predictive Analytics Explanatory Models - models that are built to test causal hypotheses that specify how and why certain em...

Week 1: Introduction Explanatory vs Predictive Analytics Explanatory Models - models that are built to test causal hypotheses that specify how and why certain empirical phenomena occur. Explanatory statistical models are based on underlying causal relationships between theoretical constructs ○ Causal theoretical model -> A set of hypotheses -> Test using statistical models and statistical inference Predictive models are statistical models which can generate accurate predictions of new observations. ○ Integrate knowledge from existing theoretical models in a less formal way ○ Rely on associations between measurable variable ○ Statistical significance plays a smaller role Supervised vs Unsupervised Learning Supervised Learning: Build a model from labelled training data that allows us to make predictions about unseen or future data (testing data). Examples include: ○ Classification: predict categorical class labels based on a set of features ○ Regression: Predict the outcome of a continuous variable based on a set of features Unsupervised Learning: Allows you to discover some structure in the data from an unlabeled dataset. Examples include: ○ Clustering - group a set of objects in such a way that objects in the same group are more similar to each other than to those in other clusters without having any prior knowledge of their group memberships. ○ Topic Modelling - A technique that assigns topics to unlabeled text documents —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- Week 2 - Classification Algorithms 1 Types of Classification Classification is the problem of predicting the categorical class labels of new instances, based on a set of features. Binary Classification: Classification tasks with two classes, e.g. True/False Multi-class Classification: Classification tasks with more than two classes, e.g. Buy/Sell/Hold The Perceptron Learning Algorithm The Perceptron is a type of supervised machine learning algorithm used for binary classification. It operates as a single-layer linear classifier (single layer neural network), although it can be extended to multiple layers. In the model 𝑧𝑖 = ω0 + ω1𝑥1𝑖 + ω2𝑖: ○ The bias unit is ω0 (also known as the intercept in regression analysis) while ω1 and ω2 is known as the weights (coefficients in the regression context). ○ The unit step function is the function that predicts y (target variable) based on X (features) in the context of a classification algorithm. Adaptive Linear Functions - Adaline Adaline is a modification of the perceptron algorithm. In Adaline, the weights are updated using the errors computed based on the output from the linear activation function and true class labels 1 Hyperparameters They are parameters that are set by the analyst and not optimised from the data, e.g. learning rate, number of epochs Feature Scaling Feature Scaling is a method used to transform the range of independent variables or features of data. E.g standardisation. Leads to quicker convergence of optimisation algorithms such as gradient descent. Feature standardisation makes the values of each feature in the data have zero mean (when subtracting the mean in the numerator) and unit-variance. Python Libraries Pandas: Pandas is a Python library used for data manipulation and analysis. It has functions for analysing, cleaning, exploring, and manipulating data Numpy: A python library used for working with arrays. It simplifies array operations in Python, making it useful for data analysis. Scikit-learn: Scikit-learn is a versatile machine-learning library in Python that provides simple and efficient tools for data mining and data analysis. It features various algorithms for classification, regression, clustering, dimensionality reduction, and model selection, along with utilities for data preprocessing and model evaluation. Gradient Descent Gradient descent is a method in machine learning that tweaks model parameters step by step to reduce a cost function. It works by finding the direction of the steepest decrease in the cost function and adjusting parameters accordingly until reaching the best value Stochastic Gradient Descent (SGD) (iterative or online gradient descent) updates the weights incrementally for each training example Mini Batch Descent: Weights are updated by using the subsets of training data to compute the gradient —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- Week 3 - Classification Algorithms 2 Overfitting and Underfitting - Overfitting happens when the model captures "patterns" in the training data that do not repeat in new data. Essentially, the model fails to generalize to unseen data - Underfitting happens when a model cannot capture the underlying trend of the data, usually due to being too simple, resulting in poor performance on both training and new data. Bias Variance Takeoff The Bias-Variance Tradeoff is a fundamental principle that describes the tradeoff between the model's complexity (variance) and its accuracy in capturing underlying trends (bias), where ideally, one seeks to find an appropriate balance between bias and variance to achieve predictions as accurate as possible. 2 Regularisation Regularization is a technique used to prevent overfitting by adding a penalty on the larger magnitudes of model parameters. It's typically done by adding a regularization term to the loss function, such as L1 or L2 regularization ○ The cost function with L1 regularisation is given by: ^ 𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝐿𝑜𝑠𝑠(𝑦, 𝑦) + λ∑ |𝑤𝑖 | L2 regularisation adds a penalty equal to the square of the magnitude of coefficients. ^ 2 𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝐿𝑜𝑠𝑠(𝑦, 𝑦) + λ∑(𝑤𝑖 ) The primary reason for regularization is to introduce and control the trade-off between bias and variance, leading to more generalised models that perform better on unseen data. Accuracy Formula Misclassification Error: Misclassification error is the fraction of observations incorrectly classified. To calculate it, you have to sum all misclassified examples and divide the sum by the number of observations. Logistic Regression Logistic Regression is indeed a classification algorithm commonly used for binary classification tasks. It models the probability that a given input belongs to a particular class. Conditional Probability - Odds: - Log-odds = logit function of odds (see lecture notes) - Sigmoid function is the inverse of the logistic function 3 Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) - estimating parameters of a probability distribution by maximizing a likelihood function so that the observed data is most probable. Weights are optimised using MLE. Likelihood Function Likelihood function = the joint probability distribution of sample under assumption that the errors are independent. Cost Function The cost function is formed by taking the negative/inverse of the log likelihood —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- Week 4 - Classification Algorithms 3 Support Vector Machine (SVM) SVM is the only algorithm that aims to maximise the margin between the decision boundary that separates classes and the closest data points from each class, and thus obtain optimal weights. Less sensitive to outliers than other classification algorithms because it mostly cares about the points closest to the decision boundary Maxmimum Margin Classification Margin: normalized distance between the positive and negative hyperplanes Models with large margins are less likely to be overfitted Linearly Nonseparable Cases Slack Variables: Slack variables are used in soft-margin classification. This allows the algorithm to converge (we can find optimal w) even when dealing with non linearly separable data Kernel Methods: Seal with linearly inseparable data. It computes a nonlinear function of the original features to provide additional dimensions in the feature space. This allows us to separate the two classes via a linear hyperplane that becomes a nonlinear decision boundary when we project is back to the original feature space Decision Tree Learning Decision trees are a type of supervised learning algorithm, which models decisions and their possible consequences as a tree-like structure. Deep trees -> overfitted Maximising Information Gain Information Gain (IG) is the difference between the impurity of the parent node and the sum of the child node impurities The node impurity is a measure of the homogeneity (sameness) of the labels at the node The lower the impurities of the child nodes the larger the information gain 4 Measures of Impurity Entropy quantifies the amount of uncertainty or disorder in a system, often used in decision trees for measuring impurity Classification error measures the proportion of misclassified instances in a dataset Gini impurity calculates the probability of incorrect classification by randomly assigning a label to a randomly chosen sample These measures are used to evaluate the quality of a split in the decision tree and to decide how to divide the data at each node to achieve the most homogenous subgroups with respect to the target variable How are they Built? 1. Choose the Best Feature to Split On: 2. Split the Data 3. Recursively Repeat for Each Child Node: 4. Stopping Criteria: Tree Pruning Tree pruning in the context of decision trees is a technique used to reduce the complexity of the final model and thus help prevent overfitting. There are two types: Pre-Pruning (Early Stopping): This approach involves stopping the tree from growing beyond a certain point during its initial construction. Post-Pruning: In contrast to pre-pruning, post-pruning involves first growing a full tree and then removing branches that contribute little to the tree's ability to classify instances correctly. Random Forest Are ensembles of decision trees. They function by constructing a multitude of decision trees at training time. For classification tasks, the output of the Random Forest is the class selected by most trees. The method combines the predictions from multiple decision tree models to reduce the amount of overfitting. Random Forest Algorithm 5 1. Bootstrap Sampling: For each tree in the forest, a bootstrap sample is drawn from the original training dataset. This sample is created by randomly selecting observations with replacement, meaning the same observation can appear multiple times in the sample. This process ensures that each decision tree in the Random Forest is trained on a slightly different dataset. 2. Random Feature Selection: When growing each tree, at each split, instead of searching for the best feature among all features to split the data, Random Forest randomly selects a subset of the features. The size of the subset is typically a parameter set by the user. The best split is found within this subset. This introduces more diversity among the trees and contributes to lower correlation between the trees in the forest, enhancing the ensemble's overall performance. 3. Building Decision Trees: Each bootstrap sample is used to build a decision tree. Since the training dataset for each tree is different due to the bootstrap sampling, and only a subset of features is considered for splitting at each node, each tree in the forest ends up being different. These trees are grown to their maximum size without pruning, which is counter-intuitively controlled for overfitting by the ensemble method itself. 4. Aggregating Trees' Predictions: After all the trees have been built, the Random Forest aggregates their predictions. For classification tasks, each tree "votes" for a class, and the class receiving the majority of votes becomes the model's prediction. 5. Output Prediction: The aggregated predictions from all trees are used to make a final prediction. Hyperparameters Number of Trees (n_estimators): This is the number of trees in the forest. Generally, more trees increase model performance and robustness but also computational cost. There's a point of diminishing returns where adding more trees has a minimal effect on improving model performance. Maximum Depth of the Trees (max_depth): The maximum depth limits how deep the trees can grow. Bootstrap (bootstrap): Whether or not bootstrap samples are used when building trees. If not, the whole dataset is used to build each tree. Using bootstrap sampling usually improves model robustness through reducing variance. Criterion (criterion): The function used to measure the quality of a split. For classification, "gini" for Gini impurity and "entropy" for information gain are common choices. For regression, options like "squared_error" for mean squared error are used. K-Nearest Neighbours (KNN) KNN algorithm finds 𝑘 examples in the training set that are closest to the point we try to classify according to a distance metric chosen. KNN is a lazy learner -> it does not learn a decision boundary using some function of data but instead memorizes the training dataset KNN Algorithm 1. Choose the number of neighbors k and a distance metric 2. Find the-nearest neighbors of the data example we need to classify 3. Assign the class label by majority vote KNN Hyperparameters 𝑘 - crucial in finding a good balance between overfitting and underfitting Distance metric -> different metrics will find different neighbours —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6 Week 5 - Data Preprocessing Dealing with Missing Data Two main methods for dealing with missing data are: Deleting rows with missing values: Deletion is straightforward and maintains the purity of the data(only rows with no missing values are kept) but risks significant information loss and potential bias if the missingness is not random. Imputing missing values: Imputation preserves data integrity by filling in missing values, which can prevent sample size reduction and mitigate bias. However, it requires assumptions about the missingness mechanism and can introduce complexity into the data preprocessing phase. For Numeric Data: Mean/Median For Categorical Data: Mode Categorical Data: Nominal vs Ordinal Ordinal Features: categorical values that can be ordered or sorted. E.g. shirt size: XL > L > M > S Nominal Features: no ordering possible. E.g. colour: {green, blue, red} Encoding Categorical Variables Nominal Categorical Variables: One Hot Encoding with Dummy Variables/Features Ordinal Categorical Variables: Mapping into Integers (for size: XL = 4, L = 3, M = 2, S = 1) Feature Scaling Normalisation Normalisation (Min-Max Scaling): Method which scales features to a specified range (usually [0, 1]). However, it is sensitive to outliers. Since the scaling is based on the minimum and maximum values, extreme values can skew the scaling, compressing the majority of the data into a small portion of the scale The transformation is done using the formula: 𝑋−𝑋𝑚𝑖𝑛 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑋𝑚𝑎𝑥−𝑋𝑚𝑖𝑛 where 𝑋𝑚𝑖𝑛and 𝑋𝑚𝑎𝑥are the minimum and maximum values of the feature, respectively. Standardisation Standardisation: Method which scales features, so they have a mean of 0 and a standard deviation of 1. However, since it does not scale the data to a fixed range, it might not be suitable for algorithms that require input variables to be within a bounded range. The transformation formula is: 𝑋−µ 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = σ where µ is the mean and σ is the standard deviation of the feature. 7 Feature Selection L1 vs L2 Regularisations The key characteristic of L1 regularisation is its ability to produce sparse models; that is, models with only a subset of the coefficients being non-zero L2 regularisation does not lead to sparse models (i.e., models where some coefficients are exactly zero). Instead, it tends to shrink the coefficients evenly but does not necessarily bring them exactly to zero Key Differences between L1 and L2 Sparsity: L1 regularisation can zero out coefficients, leading to sparse models which is beneficial for feature selection. L2 regularisation, on the other hand, does not produce sparse models but rather shrinks the coefficients towards zero. Number of Solutions Path: L1 regularisation can lead to multiple solutions while L2 regularisation leads to one solution. Robustness: L1 regularisation is more robust to outliers since it penalises the absolute value of the coefficients. Computational Difficulty: L2 is easier to optimise Sequential Feature Selection Greedy algorithms are a class of algorithms that make the locally optimal choice at each step, with the hope of finding a global optimum. Their main limitation is that they do not always guarantee the globally optimal solution, which arises from their sequential approach to decision-making, which can lead them to overlook better solutions that require initially less optimal choices Sequential Backward Selection (SBS) Sequential Backward Selection (SBS) is a feature selection technique used in machine learning to reduce the dimensionality of the data by removing irrelevant or redundant features. This method aims to improve the performance of a model by decreasing the computational complexity and helping to mitigate the risk of overfitting 1. Start with the Full Feature Set: 2. Evaluate the Performance: 3. Remove a Feature: 4. Determine the Feature to Remove: 5. Permanently Remove the Feature: 6. Repeat steps 2 through 5 until the desired number of features is reached or until removing more features does not improve the performance of the model Feature Importance with Random Forests Using a random forest we can measure the feature importance as the averaged impurity decrease computed from all decision trees in the forest (without making any assumptions about whether the data is linearly separable or not) —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- Week 6 - Dimensionality Reduction via Data Compression High Dimensionality High Dimensionality = Large number of features Precise fitting of weights requires large amounts of data Computational power costs 8 Dimensionality Reduction Dimensionality reduction refers to the transformation of the features in the dataset from a high-dimensionality space (many features) to a low-dimensionality space (fewer features than the original number of features) while attempting to retain meaningful properties of the original data. It is needed for at least three reasons: a. Reducing the extent of overfitting b. The amount of data needed to obtain reliable results with high-dimensional features grows exponentially with the dimensionality and is often not available c. Analysing high-dimensional data (hundreds or thousands of features) is often computationally intractable Methods of Dimensionality Reduction I. Regularisation - imposing a penalty (e.g. L1 and L2 penalties) on large parameter values with the aim of reducing small parameters to zero and hence performing feature selection in this way. ○ Finding a balance between having overfitting (small prediction errors captured by the cost function value) and underfitting (small parameter values encouraged by penalty). II. Sequential Feature Selection - selecting a small number of relevant features from a larger set via some kind of selection algorithm, e.g. Backwards Selection Algorithm III. Feature Extraction - summarise the information content of the dataset by transforming the feature space into a smaller-dimensional space, e.g. via PCA or LDA. (PCA and LDA explained in the next part) Principal Component Analysis (PCA) Principal Component Analysis (PCA) is an unsupervised linear transformation technique used for dimensionality reduction. It finds uncorrelated features that explain most of the variance in high-dimensional data. For linearly inseparable data use KPCA (Kernel Principal Component Analysis) PCA Algorithm 9 Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. LDA is a supervised algorithm and uses known class labels to make the data as separable as possible. Find features that optimize class separability by considering target variable -> supervised learning LDA Algorithm Kernel Principal Component Analysis (KPCA) Kernel Principal Component Analysis (KPCA) is an extension of Principal Component Analysis (PCA) that uses kernel methods to allow for nonlinear dimensionality reduction. KPCA will transform data which is not linearly separable onto a new, lower-dimensional subspace which is linearly separable. —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10 Week 7 - Model Evaluation and Hyperparameter Tuning Pipelines Pipelines in scikit-learn are tools for combining multiple processing steps into a single scikit-learn estimator. A pipeline bundles together a sequence of data transforms along with a final estimator. The primary advantages and uses of pipelines in scikit-learn include: 1. Simplicity and Convenience: 2. Reproducibility: 3. Code Maintenance: 4. Parameter Tuning: Tuning Hyperparameters Holdout Cross Validation In ML, the holdout method extends to a three-way split. This approach further refines the model evaluation and selection process by dividing the dataset into: 1. Training set: Used for training the model, adjusting the model's parameters based on the input data and the corresponding labels or outcomes. 2. Validation set: Used for tuning the hyperparameters of the model and making decisions about which model or models perform best without using the test set. This set acts as a proxy for the test set during the model development phase. 3. Test set: Used to evaluate the final model's performance. This set is held out and used only after the model has been finalised and the hyperparameters selected to assess its generalisation ability on unseen data. It is sensitive to data partitioning K-Fold Cross Validation K-fold cross-validation is a widely used method for evaluating the performance of machine learning models, aiming to provide a more reliable assessment than the simple holdout method. 𝑘−1 folds for model training, 1 fold for validation Stratified cross validation: when unequal class proportions How it Works 1. Divide the Dataset: The entire dataset is divided into "k" equally (or nearly equally) sized folds or subsets. 2. Perform Sequential Training and Validation: The process is repeated "k" times, each time using a different fold as the validation set and the remaining "k-1" folds combined as the training set. This way, each fold gets to be used as a validation set exactly once. 3. Training and Validation: For each iteration, the model is trained on the training set and evaluated on the validation set. The performance measure, such as accuracy for classification tasks or mean squared error for regression tasks, is recorded for each iteration. 4. Average the Performance: After all "k" iterations are completed, the performance scores are averaged to provide a single performance metric. This metric is considered a more robust estimate of the model’s ability to generalise to unseen data compared to using a single train-test split. Hyperparameter Tuning via GridSearch Hyperparameter tuning via grid search is a systematic approach to optimising the hyperparameters of a model. - Grid search automates the process of finding the best combination of hyperparameters by exhaustively searching through a specified subset of hyperparameters. 11 How it Works: 1. Define the Parameter Grid: You start by defining a grid of hyperparameter values you want to test. This grid is essentially a dictionary that maps hyperparameter names to lists of values for those parameters. 2. Set Up the Model: Choose the model you want to tune the hyperparameters for. 3. Configure Grid Search: Use a grid search tool (like GridSearchCV from scikit-learn) and pass the model, the parameter grid, and other optional settings such as the cross-validation strategy 4. Fit the Grid Search: Run the grid search on your data. The grid search method will then systematically train and evaluate the model using cross-validation for each possible combination of hyperparameters in your grid. 5. Evaluate Results: Once the grid search has been completed, you can review the results, comparing the performance of each hyperparameter combination. The grid search process identifies the combination that yields the best performance metrics (e.g., accuracy for classification tasks). 6. Best Model and Parameters: Use the best combination of hyperparameters to train your final model. The grid search object provides you with tools to easily retrieve the best parameter set and the model trained with it. Learning Curves Learning curves are a diagnostic tool used to understand how the model performance depends on the sample size, and to identify issues such as overfitting or underfitting Validation Curves Validation curves are a diagnostic tool used to understand how the performance of a model changes as the value of one of its hyperparameters is varied. - They are useful in identifying overfitting and underfitting controlled by the strength of the regularisation parameter - They provide insights into how adjustments to hyperparameters affect the model's ability to generalise from the training data to unseen data. Confusion Matrix A performance evaluation tool used in classification tasks. The confusion matrix contains four components True Positives (TP): The number of positive instances correctly classified as positive. True Negatives (TN): The number of negative instances correctly classified as negative. False Positives (FP): The number of negative instances incorrectly classified as positive (also known as Type I error). False Negatives (FN): The number of positive instances incorrectly classified as negative (also known as Type II error). Precision, Recall and F1 Precision - Probability of correctly detecting effect 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) Recall - Recall measures the ability of the model to identify all relevant instances. It is the fraction of positive instances that the model correctly identifies out of all actual positive instances. 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝐹𝑁 + 𝑇𝑃) 12 F1 score - A combination of PRE and REC. The F1 score is especially useful when you need to balance precision and recall, which is often the case in datasets where both false positives and false negatives carry a significant cost, or when the class distribution is imbalanced. 𝐹1 = 2 * [𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙/(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)} —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- Week 8 - Clustering Analysis Ensemble Methods An ensemble method involves generating multiple models (or learners) and combining their predictions. They are designed to improve the overall performance of predictive models by reducing both bias and variance, although in different ways depending on the type of ensemble method used. Predictions made via the voting of base models: Majority vs Plurality Voting Can improve bias and/or variance depending on the method used Majority vs Plurality Voting Majority Voting occurs when the final prediction is the class that receives more than half of the votes from the ensemble of classifiers. For a decision to be made by majority voting in a binary classification problem, more than 50% of the classifiers must agree on the same class Plurality Voting is where the class with the highest number of votes from the models gets chosen, even if it doesn't surpass half of the total votes. This method is particularly useful for multi-class classification problems, where there are more than two possible outcomes, and getting a majority for one class is less likely. Hard Voting vs Soft Voting Hard voting is a technique used in the context of predictive ensembles for classification problems. It relies on the most frequent (after taking into account weights) class label predicted by the ensemble members Soft voting is a method used to combine the predictions from multiple classifiers into a single, more accurate prediction, especially in classification problems. It takes into account the probability estimates for each class label provided by each classifier and averages these probabilities to determine the final output prediction. Bagging "Bagging," short for Bootstrap Aggregating, is an ensemble learning method that works by reducing variance (prevent overfitting), without significantly increasing bias. Create bootstrap samples Base models are of the same type: typically unpruned decision trees Reduces overfitting without significantly increasing bias Adaptive Boosting (AdaBoost) Adaptive Boosting, or AdaBoost, is an ensemble learning technique that works by combining multiple weak learners into a single strong learner. 13 Reduction of Bias and Variance: Although each weak learner has high bias and low variance, their combination tends to reduce both bias and variance. Sensitivity to Noisy Data: Because AdaBoost focuses on instances that are hard to classify, it can be sensitive to noise and outliers, as these can receive disproportionately high weights. How it Works 1) Each instance (observation) in the training set is initially assigned an equal weight 2) Training a Weak Learner: A weak learner is trained on the weighted instances. It attempts to minimise the weighted error based on the current distribution of instance weights. 3) Calculating the Weighted Error Rate: After training, the algorithm calculates the error rate of the weak learner, which is the sum of the weights associated with the misclassified instances divided by the total sum of weights. 4) Updating the Weak Learner's Weight: The weight of the weak learner is calculated using its error rate. This weight reflects the learner's contribution to the final decision. A lower error rate results in a higher weight. 5) Updating Instance Weights: The weights of the instances are then updated to give more emphasis to the misclassified instances for the next weak learner. 6) Normalisation: After updating the weights, they are normalised so that the sum of weights is 1, ensuring they form a proper distribution over the instances. 7) Repeating the Process: This process of training a weak learner, calculating its error, updating its weight, and then updating t he instance weights is repeated for a predefined number of iterations or until perfect predictions are achieved. —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- Week 9 - Combining Different Machine Learning Models Clustering Analysis Clustering is a technique that organises data into clusters or groups based on similarity without prior knowledge of group assignments - a type of unsupervised learning Distinctions between Supervised and Unsupervised Learning 1. Labelling of Data: ○ Supervised Learning requires a dataset that includes feature-target variable pairs. The model learns the relationship between the target variable and features. ○ Unsupervised Learning works with datasets that do not have labelled outputs. 2. Objective: ○ Supervised Learning the primary objective is to predict or determine an outcome (label) for new, unseen data based on the learned relationships between the features and target. It is used for classification and regression tasks where the relationship between the input variables and the target variable is modelled. ○ Unsupervised Learning aims to model the underlying structure or distribution in the data in order to learn more about the data. The goal is to explore the structure of the data to find patterns, clusters, or associations without knowing the true labels. 14 K-Means: Clustering Technique The K-Means method is a widely used algorithm for clustering, which partitions n items into k clusters, where each item belongs to the cluster with the nearest mean. The algorithm operates iteratively to assign each data point to one of k groups based on the features that are provided. K-Means Algorithm Choosing Optimal 𝑘𝑘 The Elbow Method: a heuristic used in determining the number of clusters 𝑘 using inertia Silhouette Plots: A graphical tool to plot a measure of how tightly grouped the examples in the clusters are K-Means++: Clustering Technique K-Means++ is an algorithm for choosing the initial values for the K-Means clustering algorithm. K-Means++ vs K-Means The goal of K-means++ is to spread out the initial centroids than the random initialization used by K-means, which can lead to suboptimal results. Hierarchical Trees: Clustering Technique Hierarchical Trees: A method of cluster analysis which seeks to build a hierarchy of clusters. There are two main approaches to organising clusters as hierarchical trees: Agglomerative (Bottom-Up) Method: Start by assuming that each example is a single cluster. Merge closest pairs of clusters iteratively until only one cluster remains Divisive (Top-Down) Method: Start with one cluster. Split the cluster into smaller clusters iteratively until each cluster contains only one example 15 Agglomerative Clustering: Measuring Distance 1. Single Linkage Approach: Compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest 2. Complete Linkage Approach: Compute the distance between the most dissimilar members for each pair of clusters and merge the two clusters for which the distance between the most dissimilar members is the smallest Visualising Hierarchial Clusters A dendrogram is a tree-like diagram that is used to illustrate the arrangement of the elements or clusters formed during hierarchical clustering analysis. Visually displays the sequence of cluster mergings and the distance at which each merging occurred. Particularly useful in agglomerative (bottom-up) hierarchical clustering, although they can represent divisive (top-down) clustering processes as well. Agglomerative Complete Linkage Algorithm 1) Compute the distance matrix of all examples containing distances between all data points 2) Represent each data point as a cluster 3) Merge the two closest clusters based on the distance between the most dissimilar members 4) Update the similarity matrix (containing distance metrics) 5) Repeat 3 - 4 until only one cluster remains DBSCAN (Density-Based Spacial Clustering of Applications with Noise) The DBSCAN algorithm is a clustering method that identifies clusters in datasets based on the density of data points. It classifies clusters based on the idea that a cluster in a dataset is a high-density area surrounded by a low-density area. Density: number of points within a specified radius Key Concepts of DBSCAN Core Points: A point is considered a core point if it has a minimum number of points (MinPts) within a given radius (ϵ, epsilon). These points are essentially at the center of a cluster. Border Points: A border point is not a core point but falls within the radius of a core point. Noise Points: Any point that is not a core point or a border point is considered noise or an outlier, not belonging to any cluster. DBSCAN Algorithm 1. Specify the radius 𝜀 2. Set the minimum number of neighbouring points - MinPts 3. Label each point as either a core, border or noise point 16 4. Form a separate cluster for each core point or connected group of core points (core points are connected if they are no further away than 𝜀) 5. Assign each border point to the cluster of its corresponding core point —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- Week 10 - 11: Regression Analysis Exploratory Data Analysis (EDA) Exploratory Data Analysis, or EDA, is a preliminary step before more formal statistical techniques/analytics are applied and can be crucial for understanding the underlying structure of data sets. EDA is used for: Finding anomalies/outliers and missing data Gaining insights into the variables in a dataset Uncovering relationships between variables Examples: 1. Tables of Summary Statistics 2. Histograms/Distribution Plots (numeric) & Bar/Pie Charts (categorical) 3. Scatter Plots 4. Correlation matrices Simple vs Multiple Linear Regression Simple Regression Simple Regression: Models a linear relationship between a target/dependent and one independent variable/feature with the aim of predicting the values of the dependent variable based on the feature. 𝑌 = β0 + β1 𝑋 + ϵ 𝑌 is the target variable. 𝑋 is the predictor variable/feature. β0 and β1are the model weights/coefficients. ϵ is the error term. Multiple Regression Multiple Regression: Models a linear relationship between a target/dependent and more than one independent variable/feature to predict the values of the dependent variable based on the values of the features. 𝑌 = β0 + β1𝑋1 + β2𝑋2 +...+ β𝑛𝑋𝑛 + ϵ Where we have predictor variables denoted by 𝑋1... 𝑋𝑛 and their associated coefficients β1... β𝑛. 17 Comparison - Complexity: Multiple regression is more complex, using several independent variables, unlike simple regression which uses only one. - Interpretation: The interpretation of a simple linear regression model is more straightforward than that of a multiple linear regression model because you only have to consider the relationship between two variables. In multiple regression, you must understand how all the variables work in conjunction (including interactions and multicollinearity). - Applications: Simple linear regression is used when there are no other potential confounding variables. Multiple regression is used when there are one or more influencing factors. Ordinary Least Squares - Linear Regression Model Ordinary Least Squares (OLS) - estimate parameters of the linear regression line which minimises the sum of squared verticial distances (prediction errors/residuals) from the estimated line to the training examples We can optimize regression weights (betas) using gradient descent (GD) or stochastic gradient descent (SGD) as in Adaline ○ Gradient descent: is a method in machine learning that tweaks model parameters step by step to reduce a cost function. ○ Stochastic Gradient Descent (SGD) : Updates the weights incrementally for each training example Outliers in Regression Analysis RANSAC (Random Sample Consensus): is an iterative method used in data fitting, particularly robust to outliers. In the context of linear regression, RANSAC is used to estimate the parameters of a regression model while being resilient to anomalies/outliers in the dataset. Unlike standard regression methods that can be heavily influenced by outliers, RANSAC focuses on fitting the model to only the best subset of the data, deemed "inliers". RANSAC Algorithm 1) Select a minimum number of randomly chosen examples to be treated as inliers and fit the model 2) Test all other data points against the fitted model and add those points which are within a user-given distance from the fitted line 3) Refit the model using all inliers 4) Terminate the algorithm if the fraction of the number of inliers over the sample size exceeds a predefined threshold or if a fixed number of iterations are reached. Go to step 1 otherwise. Inliers and Outliers Inlier: An observation or data point that falls within the expected range or pattern of a dataset, contrasting with outliers that deviate significantly from the norm. Outlier: An observation or data point that significantly deviates from the rest of the dataset, often indicating a rare occurrence, measurement error, or important information deserving further investigation. 18 Evaluating Regression Performance Mean Squared Error (MSE) Mean Squared Error (MSE) is the average of the squares of the errors—that is, the average squared difference between the predicted values and the actual value. It provides a simple measure of the model's prediction accuracy by indicating how close the predictions are to the actual outcomes. A lower MSE suggests a better fit of the model to the data. MSE is heavily influenced by outliers due to squaring error terms. 1 ^ 2 𝑀𝑆𝐸 = 𝑛 ∑(𝑦𝑖 − 𝑦𝑖) ➔ 𝑦𝑖 are the true values, ^ ➔ 𝑦𝑖are the predicted values, ➔ 𝑛 is the number of observations. R-Squared (R2) R-squared, also known as the coefficient of determination, is a statistical measure of how close the data are to the fitted regression line. R² provides an insight into how well the true values are replicated by the model, based on the proportion of total variation of outcomes explained by the model. An R² of 1 indicates a perfect fit. 2 2 σϵ 𝑅 = 1− 2 σ𝑦 R² is always between 0 and 1: 0 indicates that the model explains none of the variability of the response data around its mean, and 1 indicates that it explains all the variability. A higher R² indicates a better fit and shows more variability explained by the model. Residual Plot A residual plot is a graph that shows the residuals on the vertical axis and the predictive values, or an independent variable on the horizontal axis. Purpose: To detect non-linearity, To identify outliers, To check if the errors are randomly distributed. Interpretation: Random Distribution: Ideally, the residuals should be randomly scattered around the horizontal axis (zero). If the residuals form a pattern (e.g., curved line, clustering), the model might be misspecified for the data. No Outliers: No single residuals should be too far from zero compared to others. Mean Absolute Percentage Error (MAPE) Mean Absolute Percentage Error (MAPE) computes the average percentage variance between predicted and actual values, offering insights into the model's predictive accuracy. A lower MAPE indicates greater predictive performance, indicating closer alignment between the model's forecasts and actual used vehicle values. 19 Regularisation Methods for Regression Analysis Regularisation: introduce a penalty against complexity by shrinking parameter values L2 Regularisation - Ridge Regression Ridge Regression adds a penalty equal to the square of the magnitude of coefficients to the loss function, reducing the size of coefficients but keeping all variables in the model. This method is effective when variables are highly correlated (multicollinearity). L1 Regularisation - LASSO (Least Absolute Shrinkage and Selection Operator) Lasso Regression introduces a penalty that is the absolute value of the magnitude of coefficients. This can shrink some coefficients to zero, effectively performing feature selection and producing a model that includes only the most significant variables. Lasso simplifies models by completely removing weights of less significant features L1 + L2 Regularisation - Elastic Net Elastic Net combines penalties from both Ridge and Lasso, integrating the benefits of both. It is useful when there are numerous highly correlated variables and helps to maintain a balance between feature selection and coefficient shrinkage. Non-Linear Regression Models Polynomial Regression Polynomial regression is a type of regression analysis that models complex relationships that do not follow simple linear patterns. This method fits a curve between the target and features, thus making it valuable for forecasting in scenarios like marketing analysis and financial modelling where the understanding of nonlinear relationships between variables can lead to significantly better decision-making outcomes. Disadvantages of Polynomial Regression Overfitting Risk: Computational Complexity: Decision Tree Regression Decision trees are employed to predict a continuous outcome by learning decision rules inferred from the data features. They are incorporated into regression analysis because they offer a non-linear approach to solving regression problems, allowing them to capture complex patterns that linear models may miss. Advantages of Decision Trees Interpretability Non-linearity and Interaction Handling No Need for Feature Scaling Disadvantages of Polynomial Regression Overfitting High Variability: Greedy Algorithms: 20 Random Forest Regression Random Forest Regression is an ensemble learning algorithm that utilises multiple decision trees to make predictions by averaging the outputs of the individual trees. Each tree in the forest is built from a random sample of the data points (bootstrapped sample), and to split nodes, a random subset of the features is considered. Advantages of Random Forests Robust to Overfitting Handles Nonlinear Patterns Provides Feature Importance: Disadvantages of Random Forests Computational Cost Less Intuitive Interpretation: Other Non-Linear Regression Models: Bagging Regressor AdaBoost Regressor SVR MLP Regression Ensemble Methods Voting – average predictions from base models Stacking – build a meta-model that uses predictions from base models as features Time Series Models A time series model is a statistical framework that analyses and forecasts sequential data points measured at regular time intervals, capturing patterns and trends to make predictions about future values. Data is sequential - collected over time Features are the current and past values of the target variable Seasonality: a repeating pattern that occurs at regular intervals of time Univariate vs Multivariate Univariate: Analyzes and forecasts a single sequential data point measured at regular time intervals, capturing patterns and trends to make predictions about future values. Some univariate models include: ○ NaiveForecaster ○ AutoETS ○ AutoARIMA ○ Prophet ○ ThetaForecaster Multivariate: Analyzes and forecasts multiple sequential data points measured at regular time intervals, which helps capture patterns and trends to predict future values. —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- 21 Week 12 - Text and Sentiment Analysis Sentiment Analysis Sentiment Analysis: Concerned with analysing a body of text to understand the opinion expressed by it Bag-Of-Words Model Bag-of-Words model represents text data as numerical feature vectors The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text (such as a sentence or a document) is represented as the bag (set) of its words, disregarding grammar and word order Keeping multiplicity or frequency of words Bag-of-Words modelling typically involves the following steps Create a vocabulary of unique tokens - i.e. words - from the entire set of documents Construct a feature vector from each document that contains the counts of how often each word occurs in a particular document Example: John likes to watch movies. Mary likes movies too. Based on this document, a list is constructed as follows: "John","likes","to","watch","movies","Mary","likes","movies","too" We obtain a bag-of-words represented here as a dictionary BoW = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} N-Gram Models An n-gram is a sequence of n items from a given sample of text or speech 1-gram model - each item (token) in the vocabulary represents a single word 2-gram model - Represents a sequence of two words Example: "the sun is shining' 1-gram: 'the', 'sun', 'is', 'shining' 2-gram: 'the sun', 'sun is', 'is shining' Regular Expressions Regular Expressions: Sequences of characters that define a search pattern Word Stemming Word Stemming – transforming words into their root forms 22 Topic Modelling Topic Modeling - assigning topics to unlabeled text documents A type of clustering task Unsupervised Learning E.g. categorisation of documents in a large text corpus of new articles ○ Aim to assign category labels to those articles, e.g. sports, finance, world news, politics, etc. Latent Dirichlet Allocation (LDA) – a technique that finds groups of words that appear frequently together across different documents Using a bag-of-words matrix as in input LDA produces A document-to-topic matrix: A matrix that represents the distribution of topics in a collection of documents. Each row in the matrix corresponds to a document, and each column corresponds to a topic. A word-to-topic matrix: A matrix that represents the distribution of words in a vocabulary across a set of topics, where each row corresponds to a word and each column corresponds to a topic, with cell values indicating the importance or weight of each word in each topic. —-------------------------------------------------------------------------------------------------------------------------------------------------------------------- 23

BUSA3020 Revision PDF

Document Details

Tags

Related

Summary

Full Transcript