Machine Learning in Finance Summary PDF
Document Details
Uploaded by RecommendedSunflower
Universität St. Gallen (HSG)
Tags
Summary
This document provides a summary on machine learning in finance, discussing big data concepts, machine learning definitions, and its applications in various financial sectors. It highlights the use of machine learning for tasks such as fraud detection and stock price prediction.
Full Transcript
llMachine learning in finance summary 1.1 - Basics of machine learning Big data from a statistical perspective: - n : number of observations (How many rows in excel = n, chat: number of datapoints in the dataset) Big n = large sample size - p: number of features (dimension...
llMachine learning in finance summary 1.1 - Basics of machine learning Big data from a statistical perspective: - n : number of observations (How many rows in excel = n, chat: number of datapoints in the dataset) Big n = large sample size - p: number of features (dimensionality) (How many columns in excel (variables available in order to make the predictions on another variable) = p (columns/predictors), Big p = large dimensionality - What is needed from a method to address large sample size or high dimensionality: Big n: Large sample size — requiring machine learning techniques to efficiently process data (e.g. low complexity algorithms or distributed computing, ie. the method of making multiple computers work together to solve a common problem). Big p: High dimensional statistics — require assumption of some special structures in data and innovation in statistical procedures. Machine learning essentials and applications in finance Machine learning definition: - ML uses past data in order to learn and make predictions on unseen data. - ML is AI but not all AI is ML = subfield of AI - Definition: A process of extracting patterns, features and making useful predictions based on data - emphasis on prediction accuracy (in ML don’t explain but predict). - Data are assumed to be generated from a random process, hence any conclusion drawn is probabilistic (in ML don’t have true underlining of data). Making better predictions using ML: a. Larger amount of data, or better-quality data lead to more accurate prediction. b. Or we can use better learning algorithms (better model) How can ML be used in Finance: Here, we consider Finance in its broader sense - The financial sector is a section of the economy made up of firms and institutions that provide financial services and this sector comprises a broad range of industries including banks, development banks, investment companies, insurance companies, real estate firms etc. Some examples of use of machine learning in finance include: Fraud detection in loan applications. Facilitation of banks underwriting process (chat: refers to the assessment and evaluation of the risk associated with lending money or extending credit to a customer), i.e. by training algorithms on large amounts of customer data, the system can make quick underwriting and credit scoring decisions. Stock price prediction based on quarterly revenues, business news etc. Real estate price prediction based on economic and demographic indicators etc. Statistical learning Statistical learning: - Y : dependent variable, response variable, output. - X = (X1, X2..., Xp), regressors, covariates, features, independent variables, inputs. Suppose that the output variable Y depends on the input variable X and we can model the relationship in the following manner: Y = f(X) + ε (1) where f is an unknown function and ε captures measurement error (randomness) with mean zero. Definition: We call statistical learning the task of estimating f from given data which we call training data and here take the form (X1, Y1), (X2, Y2),..., (Xn, Yn). Why to estimate f? For inference or prediction (here we care about prediction). - If, in practice, the output Y is not easily obtained but the input X is, then it is desirable to be able to predict what the output Y will be for a given value of X. - Such a prediction has the form ˆY = ˆf(X) where ˆf is an estimate of the unknown function f. Reducible and irreducible error: - For a fixed value of X, our overall error can be decomposed into two parts: the irreducible error (even if we knew f, we would still make errors in prediction) and reducible error - We will focus on minimizing the reducible error, since the irreducible error is beyond our control once we have chosen our predictors. Categories of learning Categories of learning: supervised vs. unsupervised learning. Supervised learning: 1. Regression: Y is continuous/numerical (e.g. stock price) 2. Classification: Y is categorical (Insurance claim, acceptance for a loan or not) -> Some methods work well on both types, e.g. Neural Networks. -> Other methods work best on regression tasks, e.g. Linear Regression, or on classification tasks, e.g. linear discriminant analysis. Supervised learning is all about how to estimate f. Two reasons for estimating f: a. Prediction (predictive accuracy) (chat: forecast accurately outcomes). b. Inference (estimation accuracy + uncertainty) - Inference can be understanding causal relationships (chat: interpreting and explaining the model’s behavior and the underlying patterns in the data). -> Supervised learning for prediction is the focus of our course. Internet: SL where an input object and a desired output value train a model. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances = for any input we know the output we are producing. Unsupervised learning: No outcome variables, just a set of covariates measured on a set of samples (only have X) Goal is to extract useful summarizing features based on data ➔ No input and try to extract useful patterns from the data With these methods it is more difficult to know how well you are doing. - Example of method: Clustering: Find groups of samples that behave similarly. - Example of application: Market segmentation: we try to divide potential customers into groups based on their characteristics. Measurement of the success/performance of a machine learning method/algorithm Success= performance = is it good For supervised learning: compare accuracy with true value = loss function which quantifies how far our prediction is from the TV. For unsupervised: see if can extract useful info from data Loss function: Definition: A loss function quantifies how far your predicted value is to the true value. - The loss function directly encodes our knowledge about a given problem and the goal we set for the learning machine. The definition of the loss function thus depends on whether we are in a classification or regression setting. In all cases, a loss function must obey a few basic rules: i) it should compare a label with a predicted label and output a positive real number, ii) ii) it must output zero when the two labels are equal. For regression, we care about the squared differences between the actual and the predicted values. For classification, the typical loss function is the 0-1 loss, which merely assigns a cost of 1 to misclassifications and 0 to correct predictions. Given that loss function, the risk of a classifier corresponds to its probabilty of misclassifying a random pattern. Risk: The risk of a learning machine or model is the amount of prediction error that can be expected when using that model. The main goal in supervised learning is to minimize this risk in order to obtain the most accurate model. The risk is not a quantity that can be computed exactly, since it measures the quality of predictions of unknown phenomena, and in practice must be estimated. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true ”risk”) because we don’t know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the ”empirical” risk). ➔ True risk can’t be known ➔ Empirical risk can be measures We use a loss function, L to quantify the prediction accuracy. We consider mainly two types of loss functions: For a given loss function, L, the risk of a learning function f is defined by the expected loss: ➔ We aim to find f(x) that minimize R(f) pointwise (for each point of a given set). Measuring the quality of fit to data: Suppose we observe i.i.d. training data (Xi, Yi), i = 1,..., n. The empirical risk of any function f is: We can find the empirical risk minimiser ˆf which minimises Rn(f). The training error of the empirical risk minimiser is Rn(ˆf). Regression: Mean squared error (MSE) is given by Classification: Misclassification error rate is given by Our method is designed to make MSE or the MER small on the training data we are looking at. If error prediction is small = accuracy is high But the training error is not all we care about because the model could be overfitted too and thus wouldn’t work on a new data but we still want low errors. Test errors: What we really care about is how well the method works on new data ( ̃Xi, ̃Yi)i∈{1,...,m}. We call this new data test data. We aim to choose the method that gives the lowest test MSE or MER for regression and classification problems, respectively. Importantly, ˆf is independent of the test data, so test MSE is a more accurate approximation of the true risk of ˆf. = Test accuracy on how well our model generalizes in all circumstances (not just on one data set that we have) Flexibility vs interpretability: For more complex or flexible models, we need to be especially aware of the distinction between training and test errors. Flexibility: how well a model fit the pattern (e.g. a curve) -> can predict with a small predictions error but cannot really explain -> the higher the less interpretable. Interpretability: can't expect to catch all patterns in your data but can explain -> the higher, the less flexible. Training vs. Testing errors: In general, the more flexible a method is the lower its training error rate will be, i.e. it will “ fit” or explain the training data very well. However, the test error rate may in fact be higher for a more flexible method than for a simple approach like linear regression. Bias-Variance Tradeoff Bias is the simplifying assumptions made by our method to make the target function easier to approximate. Variance is the amount that the estimate of the target function will change given different training data (how much the model’s estimate of the target function would change if it were trained on different datasets). Test MSE (chat) = how far the model’s predictions are from the true values on new data. Low bias = every time run the model, have often the same answer. More flexible/complicated one is not always better! Chat gpt: As the flexibility of the model f^\hat{f}f^ increases: The bias decreases: More flexible models can capture more complexity in the data, meaning they are better at approximating the true function f(X). The variance increases: However, more flexible models also become more sensitive to small changes in the training data, causing higher variance. This is the essence of the bias-variance tradeoff: More flexible models: Low bias, high variance (overfitting). Simpler models: High bias, low variance (underfitting). Low bias: Ensuring that the model’s predictions are close to the true function f(X)f(X)f(X). Low variance: Ensuring that the model generalizes well to new, unseen data and doesn’t fluctuate too much based on the training set. The bias-variance tradeoff is about finding the right balance between simplicity and complexity in the model. Low bias means that the model’s predictions are on average close to the true function. Low variance means that the model doesn’t change too much when we use different training data. 1.2 – Linear Models Regression analysis characteristics Regression analysis: - It’s the study of dependence. - It is a method to quantify the relationship between a variable of interest and explanatory variables. - We want to model a real data set using as simple a model as possible, while not losing explanatory power on any complex phenomenon observed in the data. Regression analysis applicability: Regression analysis is proved to be very useful in many fields, with a very wide applicability: Finance people may want to determine the systematic risk for a particular stock, which is referred to as beta. A stock’s beta is a measure of the volatility of the stock compared to a benchmark such as the S&P 500 index. Economists analyse how an economic indicator, like GDP, is related to other macroeconomic variables like industrial production, housing index, population of a city, etc. Actuaries investigating how the number and amount of car insurance claims of an individual is related to certain characteristics of the car and the policyholder so that the premium can then be set accordingly for different people. Regression analysis include linear regression/modelling, which forms the basis of this class. Simple linear Regression A simple linear model is a special case of linear model, where there is only one explanatory/predictor variable. The data comes in a pair (x, Y), with x represents the covariates and Y the response variable. Assume Y is a continuous random variable. We assume the covariates x is fixed and known. We use n to denote the sample size. That is, the data consists of n pairs (x1, y1),..., (xn, yn). We assume a linear model for a generic pair (x, Y) and our data (xi, yi) to be Y = β0 + β1x1 + εi where i = 1,..., n and where we assume that ε has mean 0 and variance σ and the εi’s are independent of each other. The term εi represents measurement/individual error, or unexplained variation. Note that the E(Y) = β0 + β1x and hence β1 = dE(Y) /dx. Therefore, the interpretation of β1 is the average change in the response Y for a unit change in x. The assumptions in model (1) implies: 1) Error can be positive or negative, since εi has mean 0. 2) Group mean: For any data with covariate x, the mean response is E(Y) = β0 + β1x 3) Group variance: var(Y) = σ2, independent of the value of x. 4) Error is assumed to be additive on the group mean. Least square estimator / Ordinary least square (OLS) estimator: We want to estimate β0 and β1 from the data, and also estimate σ2 which is the residual variance. To derive the least square estimator from the data, we minimise the sum of squares of the residuals: Minimizing the sum of squares of the residuals: To find ˆβ0 and ˆβ1 that minimise the function g, we differentiate with respect to β0 and β1 and set the derivatives to zero: Solving we get: where Sxx is is the sum of the squares of the difference between each input x < and the mean x value and Sxy is sum of the product of the difference between x and its means and the difference between y and its mean. To estimate σ2, note that , hence we can estimate it by It can be proven that ˆσ2 is biased Interpretation of coefficients: - The intercept ˆβ0 is the mean value of the dependent variable when the independent variable takes the value 0. - Its estimation has no interest in evaluating whether there is a linear relationship between two variables. - It has, however, an interest if you want to know what the mean value of output could be when the input equals zero. - The slope ˆβ1 corresponds to the expected variation of the output when an input varies by one unit. - It tells us about the sign of the relationship between the input and output and the speed of evolution of the output as a function of the input. The larger the slope in absolute value, the larger the expected variation of output for each unit of input. - Note, however, that a large value does not necessarily mean that the relationship is statistically significant. We have not assumed the error ε is normally distributed. Assuming so allows to make inference on the parameters, like testing their significance. We can then also construct confidence intervals and prediction intervals for new predictions. In this class, we use the linear regression for solely predictive purposes. Now we will look at how a linear model is explaining the variation of the data Decomposition of the total variation of the data: The spread/deviation of each data point can be measured by (yi − ̄y)2 , since without the knowledge of xi, the best estimate at any x will be ̄y. We square the deviation since we want to add up these deviations without cancelling out one another for the data. Hence the total deviation, or the Total Sum of Squares for the data (kind of measures the total variation in the data), can be measured by With knowledge of xi’s we estimate yi by ˆyi = ˆβ0 + ˆβ1xi If this regression line is useful in predicting the yi’s, then intuitively the deviation yi − ˆy should be small, while ˆy − ȳ will be close to the original deviation. To determine how useful the regression line is, we decompose A good regression line should have a “small” unexplained deviation for each data point. Squaring both sides and sum up over all data points, we have In words, we decompose the total variation by Total SS = SS(reg) + RSS (9) where SS(reg) = SS due to regression, and RSS = Residual SS. This decomposition leads us to define the coefficient of determination, or the “R-squared”, by By the above decomposition 0 ≤ R2 ≤ 1. If the regression is a very good one so that RSS is small, then SS(reg) is close to the Total SS, so that R2 ≈ 1. On the other hand, a bad regression gives a large RSS, so that R2 ≈ 0. Chat: if a regression is good, it means that the model effectively explains the relationship between the dependent (response variable) Y an the independent (explanatory) variable (s) X. Multiple linear Regression While simple linear regression has only one covariate, multiple linear regression has at least two or more covariates. Suppose we have n observations, and the i-th data point is (xi1,..., xik, yi). The model is then where i = 1,..., n. We want to minimise It becomes very cumbersome to write the model and carry out analysis. Hence, you will often see this model written in matrix form. 1.3 – Resampling Approaches General information Resampling approaches definition: Tools that involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain more information about the fitted model. Model assessment: estimate test error rates. Model selection: select the appropriate level of model flexibility. Model aggregation: reduce variance and improve accuracy. Example of how a resampling approach could work: For example, in order to estimate the variability of a linear regression fit, we can: Repeatedly draw different samples from the training data. Fit a linear regression to each new sample. Then examine the extent to which the resulting fits differ. The benefit: Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample. But resampling approaches are computationally expensive, but computer can help us. Two famous resampling approaches: Cross Validation. Bootstrap. Resampling methods can help us to have a more informed view of the prediction error- complexity tradeoff. Bias: It measures the difference between the model’s prediction and the target value. Variance: Variance measures the inconsistency of different predictions over a varied dataset Ways to estimate the test errors: Best solution, a large designed test set. Not available! Mathematical adjustments to the training error rate to estimate the test error rate, e.g. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) which provide measures of model performance that account for model complexity. Resampling approaches: estimate the test error by holding out a subset of the training observations and applying the statistical learning approaches to those held-out observations. Cross validation Using the validation set to estimate the prediction risk The validation set approach: Suppose that we would like to find a set of variables that give the lowest test error rate. Give a large data set, we can randomly split the data into training set and validation set or hold-out set (usually almost half-half). The model is fit on the training set, and the fitted model is used to predict the response for the observations in the validation set. The resulting validation set error rate, typically assessed using MSE, provides an estimate of the test error rate Example autodata: Suppose that we want to predict mpg from horsepower. Two models: 1. mgp ∼ horsepower 2. mgp ∼ horsepower + horsepower2 Which model gives a better fit? Randomly split Auto data into training (196 obs.) and validation data (196 obs.) - dataset has 392 obs. in total. Fit both models using the training data set. Then evaluate both models using the validation data set. The model with the lowest validation (testing) MSE is the winner! Results: Left: validation error rate for a single split. Right: validation method repeated 10 times, each time the split is done randomly. What do you observe in the plots? There is large variability among the MSE’s. Approach’s advantages and disadvantages: Advantages: Simple Easy to implement Disadvantages: The validation MSE can be highly variable. Only a subset of observations are used to fit the model (training data). (Statistical methods tend to perform worse on fewer observations, which suggests the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.) Leave -one-out cross validation For each suggested model, we do the following: Split the data set of size n into: training data set of size n − 1 and validation data set of size 1. Fit the model using the training data. Validate the model on the single validation data. Repeat the process n times. The Model MSE: Comparisons with validation set approach: LOOCV has less bias a. We repeatedly fit the statistical learning method using training data that contains n − 1 observations. b. LOOCV tends not to overestimate the test error rate. LOOCV is computational intensive (disadvantage). We fit each model n times LOOCV produces a less variable MSE. The validation approach produces different MSE when applied repeatedly due to randomness in the spliting process, while performing LOOCV multiple times will always yield the same results. (Why?) −→ Since every row is used once as a unique test set, with all the other rows used for model building, there is no randomness with row assignment. K-fold cross validation LOOCV is computational intensive, so we often run K-fold CV instead. With K-fold CV, we divide the data set into K parts (e.g. K = 5 or 10, rule of thumb), C1,..., CK, where Ck denotes the indices of the observations in part k. There are nk observations in part k. Compute: Yi is the fit for the observation I obtained from the data with part k removed. Link between LOOCV and K-fold: LOOCV is a special case of K-fold CV, where K = n. They are both stable, but LOOCV is more computational intensive. K-fold also more stable than the validation set approach. Which CV approach is preferred: Putting aside that LOOCV is more computational intensive than K-fold CV. Which one is better LOOCV or K-fold CV? Since each training is only (K − 1)/K as big as the original training data, the estimates of prediction error will typically be biased upwards. LOOCV has higher variance than K-fold CV. There is a trade-off between what to use. Conclusion: – To compromise, we use K-fold CV with K = 5 or 10. – It has been empirically shown that they yield test error rate estimates that suffer neither from excessively high bias, nor from very high variance. Bootstrap for uncertainty quantification The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. It can be used to estimate summary statistics such as the mean or standard deviation. Bootstrap can quantify the uncertainty associated with the given estimator or statistical learning method. It is used in applied machine learning to estimate the skill of machine learning models when making predictions on data not included in the training data. A desirable property of the results from estimating machine learning model skill is that the estimated skill can be presented with confidence intervals, a feature not readily available with other methods such as cross-validation. Bootstrap can be easily applied to a wide range of approaches A bootstrap approach: Rather than repeatedly obtaining independent data sets from the population (unknown), we obtain distinct data sets by repeatedly sampling observations from the original data set. One simple example with 3 observations and sampling is performed with replacement. Bootstrap procedure: The sampling (with replacement) procedure is repeated B times to produce B different bootstrap data sets Z∗1, Z∗2,..., Z∗B and B corresponding α estimates ˆα∗1, ˆα∗2,..., ˆα∗B. We can compute the standard error of the bootstrap estimates by the sample standard deviation of the bootstrap estimates: The standard error of the mean, or simply standard error, indicates how different the population mean is likely to be from a sample mean. It tells you how much the sample mean would vary if you were to repeat a study using new samples from within a single population. The standard error of the bootstrap samples is an estimate of the standard deviation of the sampling distribution of the mean. Other use of the bootstrap: Mainly to obtain the standard errors of an estimate. Provides approximate confidence intervals for a population parameter. E.g. revisiting the histogram, the 5% and 95% quantiles of 1000 values are (0.43, 0.72). This provides an approximate 90% confidence interval for α, bootstrap percentage confidence interval. It can also be used to improve the estimate by aggregating over multiple bootstrap estimates. This is called ‘bootstrap aggregation’ or bagging. Estimating prediction error with the bootstrap: One approach is to fit the model on a set of bootstrap samples and keep track of how well it predicts the original training set. Let ˆf∗b(xi) be the predicted value at xi from the model fitted to the b − th bootstrap sample, our estimate is: Each bootstrap has significant overlap with the original data. About 2/3 of the original data points in each bootstrap sample. Why? This will cause the bootstrap to seriously underestimate the true prediction error. Cross validation vs. bootstrapping: Both are resampling methods. Cross-validation subsamples without replacement Bootstrapping samples n points with replacement Both quantify the error/uncertainty of the procedure Cross-validation uses data not in the subsample to estimate the test error. It cannot be used to estimate the error of a parameter estimate. Bootstrapping uses the same data to build the model and estimate the parameter. It measures the variance of the estimator. 2.1 – Decision Trees Classification tasks as try to predict categories/classes. Imagine you’re a portfolio manager tasked with deciding whether to invest in a particular stock. What factors they would consider in a financial decision? Variable/input we would use for our model: Use a lot of different type of info to make such a decision Have to be data driven A simple decision-making scenario: - Imagine you’re a portfolio manager tasked with deciding whether to invest in a particular stock. - You have various factors at your disposal: the company’s profitability, its recent performance, economic trends, interest rates, and perhaps even global news. - How do you decide which factors are most important to focus on, and what combination of them should lead to your final decision? - Decision-making in finance is not always straightforward and often involves complex, data-driven decisions How can a decision tree help: - A decision tree is like a flowchart for decision-making. - It helps you break down complex decisions into smaller, more manageable choices. - Each ’branch’ of the tree represents a question or a condition, like ’Is the stock’s quarterly earnings growing? Yes or No?’ - Depending on your answer, you move to the next branch, asking another question until you reach a decision—buy, sell, or hold. - Decision trees mimic human decision-making but in a structured, data-driven way Advantages and limitations: Advantages - Easy to understand and interpret - Non-linear relationships: Can capture more complex patterns than linear models - Useful for feature importance: Identifying which factors (features) matter most in making decisions Limitations - Overfitting: Decision trees can get too specific to the training data (low bias) - Instability: Small changes in data can result in large changes in the tree structure (high variance) Importance of decision trees in ML: Decision trees are foundational models for more advanced machine learning techniques which combine multiple decision trees to improve accuracy and performance. Tree-based methods: - We focus on the tree-based methods for regression and classification. - These involve stratifying or segmenting the predictor space into a number of simple regions. - Since the set of splitting rules used to segment the predictor space can be summarised in a tree, these types of approaches are known as decision tree methods. -> Find a way to divide the input space into area of simpler regions - have all info and how divide to have one question at the time = stratifying Details/structure: - In keeping with the tree analogy, the regions R1, R2 and R3 are known as terminal nodes or leaves of the tree. - Decision trees are typically drawn upside down, in the sense the leaves are at the bottom of the tree. - The points along the tree where predictor space is split are referred to as internal nodes. - In the hitters tree, the two internal nodes are indicated by the text Years < 4.5 and Hit < 117.5. - We refer to the segments of the trees that connect the nodes as branches. Decision tree is advantageous in displaying and interpreting (compared to a regression model). Details in the tree building process: We divide the predictor/covariate space, i.e. the set of possible values for X1 , X2..., Xp into J distinct and non-overlapping regions R1 , R2,..., RJ. For every observation that falls into the region Rj , we make the same prediction, which is simply mean of the response values for the training observation in Rj. In theory, the regions could have any shape. Decision trees divide the predictor space into axis-aligned high dimensional rectangles for simplicity and interpretability. Our target is to find rectangles, R1 , R2,..., RJ that minimise the RSS. Minimise the prediction error at the end Test predcition accuracy after having made decision tree, so have a test dataset. Drop the first set of data and falls in R1 bc have similaire characteristics. Mean of all the points that fall in this region. More details in tree building process: It is computational infeasible to consider every possible partition of the feature space into J rectangles. We take a top-down, greedy approach that is known as recursive binary splitting. Top-down: it begins at the top of tree and then successively splits the predictor space, each split is indicated via two new branches further down on the tree. Greedy: at each step of the tree-building process, the best split is made at the particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. We first select the predictor Xj and cutpoints such that splitting the predictor space into regions {X : Xj < s} and {X : Xj ≥ s} leads to the greatest possible reduction in RSS. Then, we repeat the process, searching for the best predictor and cutpoint to further split the data so as to minimise the RSS within each of the resulting regions. Note rather than splitting the entire space, we split one of the two previously identified regions. Now we have three regions. Again, we look to split one of these three regions further, so as to minimise the RSS. The process continues until a stopping criterion is reached, e.g. we may continue until no region contains no more than five observations. Splitting the predictor space: Generally we create the partitions by iteratively splitting one of the X variables into two regions. For example: 1. First split on X1 = t1. 2. If X1 < t1 , split on X2 = t2. 3. If X1 > t1 split on X1 = t3. 4. If X1 > t3, split on X2 = t4. Estimator of regression trees: Want a low training error to make accurate predictions Pruning a tree: The tree building process may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. We can first grow a very big tree T0 and then prune it back to obtain a subtree. Low fit on test performance bc of greediness. To fix, grow big tree, cut some of the leaves (pruning, unnecessary leaves that do not help in predict - have to choose wisely, those who don't improve the prediction accuracy) to obtain subtree = can avoid overfitting Cost complexity pruning = to understand how much we need to cut (trade off bw size of the tree and error It produces) Ta is a subset of the big tree Ta minimizes this expression Want a smaller prediction error Choosing the optimal subtree: α is the tuning parameter that controls a trade-off between the subtree’s complexity (variance) and its fit to the training data (bias). Optimal α can be obtained using cross-validation. We then return to the full dataset and obtain the subtree corresponding to the optimal α. Algorithm for building a regression tree: 1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations. 2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtree, as a function of α. 3. Use K-fold cross validation to choose α. That is, divide the training observation into K folds. For each k = 1,..., K: a. Repeat Steps 1 and 2 on all but the k − th fold of the training data. b. Evaluate the mean squared prediction error on the data in the left-out k − th fold, as a function of α. c. Average the results for each value of α and pick α to minimize the average error. 4. Return the subtree from Step 2 that corresponds to the chosen value of α Classification trees: For a classification tree, we predict that each observation belongs to the most common occurring class of training observations in the region to which it belongs. We use recursive binary splitting to grow a classification tree Recall that in a regression tree, we define the cost complexity criterion: Under classification setups, we need to substitute the loss Q MSE j by some alternative measures. Misclassification error rate: the fraction of training observations in the region that do not belong to the most common class: Two other measures of loss: However classification error is not sufficiently sensitive for tree growing and in practice two other measures are preferable. Gini index a measure of total variance across the K classes. Gini index is referred to as a measure of node purity — a small value indicates a node contains predominately observations from a single class. Cross-entropy (deviance) These two measures are used to evaluate the quality of a particular split, since they are more sensitive to node purity than the classification error rate. Advantages and disadvantages of trees: Trees are easy to explain to people. They are even easier to explain than linear regression. Some people believe that decision trees more closely mirror human decision making than do the regression and classification seen in previous lectures. Trees can be displayed graphically and easily interpreted even by a non-expert. Trees can easily handle qualitative predictors without the need to create dummy variables. Trees do not have the same level of predictive accuracy as some of the other regression and classification approaches. 2.2 - Forests Bootstrap Aggregation: The bootstrap is an extremely powerful idea but what does it have to do with statistical learning? Suppose that we have a procedure (such as trees, neural nets and etc.) that has high variance. An ideal way to reduce variance would be to take many samples from the population, build a separate prediction model using each sample and average the resulting predictions i.e. Of course, as discussed previously, we cannot take multiple different samples from the population However, we can use the bootstrap approach which does the next best thing by taking repeated samples from the training data. We therefore end up with B different training data sets. We can train our method on each data set and then average all the predictions i.e. This approach is called Bagging or Bootstrap Aggregation. Bagging regression and classification trees: Regression trees: We use average prediction to aggregate bootstrapped trees. Classification trees: for each test observation, we record the class predicted by each of B trees and take a majority vote: the overall prediction is the most commonly occuring class. Alternatively, we can take the average of B probabilities and assign to the class with the highest averaged probability Random forests: The ensemble method that we use is called random forest. It is developed by Breiman (2021) and is used to solve prediction problems. Underlying logic: Divide and conquer: split the predictor space into multiple samples. Construct a randomised tree hypothesis on each subspace (chat explanation: On each of the samples (created from bootstrapping), a randomized decision tree is built. The key difference between random forests and traditional decision trees or bagging is that random forests add an extra layer of randomness: at each node in the tree, only a random subset of features is considered when deciding the best split. This step of randomizing which features to use at each node makes the trees more independent from each other) Average these hypotheses together (chat explanation: the random forest averages the predictions to make the final prediction.) Generally, a random forest can be seen as a successor of bagging when the base learners are decision trees (chat explanation: Random forests can be seen as an improvement or successor to bagging because they add the extra step of randomizing the features used at each split, which further reduces the correlation between the trees). This is because random forest addresses the main pitfall of bagging; the issue of diminishing variance reductions (chat explanation: The problem with bagging is that while it reduces variance by averaging many trees, as more trees are added, the trees tend to become more similar (highly correlated), meaning the variance reduction diminishes. Random forests solve this problem by making the trees less correlated. By injecting randomness (choosing random subsets of features), the trees are more diverse, which leads to greater variance reduction and better overall performance). This is achieved by injecting an additional element of randomness during decision trees construction for them to be less correlated to one another. At the same time, since the base learners are decision trees there are not many assumptions about the form of the target function resulting in low bias. (Chat explanation: Decision trees are non-parametric models, meaning they don’t assume a specific form for the relationship between the features and the target variable. They can capture complex patterns in the data. Because of this flexibility, random forests tend to have low bias, meaning they can fit the training data well without making strong assumptions about the underlying distribution of the data. The averaging process in random forests helps to reduce the variance without significantly increasing bias, which results in a robust model that generalizes well to unseen data). Construction process: Random forest construction process: The first step in the random forest generation process is bootstrap sampling. The second stage is regression trees development. From each bootstrap sample, K regression trees are grown using recursive partitioning as done in Classification And Regression Trees (CART) but with a smart twist which further randomises the procedure. At each level of the recursive partitioning process, the best predictor to conduct the splitting is considered based on a fresh, each time, random sub-sample of the full set of predictors denoted as mtry. The best split is chosen by examining all possible predictors in this sub-sample and all possible cut-points as of their ability to minimise the residual sum of squares for the resulting tree. A tree stops growing when a minimum number of observations in a given node is reached but generally speaking trees comprising the random forest are fully grown and not pruned. By constructing these K trees we effectively get K estimators of function f namely h1 , h2,... , hK. The average of these individual estimators hen = 1 K ∑K k=1 hk(xn) is the random forest. Random Forests: Random forest provides an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduce the variance when we average the tree. As in bagging, we build a number of decision trees on bootstrapped training samples. But in random forest, we also randomly sample predictors: each time a split in a tree is considered, a random selection of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. Bagging is a special case of random forest when mtry = p. Hyperparameters: From the above process description, it is evident that there are three parameters whose value needs to be fixed prior to random forest development. Namely the number of trees grown, node size, and number of variables randomly selected at each split. Each of them respectively control the size of the forest, the individual tree size and an aspect of the within tree randomness. There are certain default values that have been suggested following empirical experiments on various data sets but one can use an optimising tuning strategy with respect to prediction performance to select the most suitable values specifically for the data set under study, mtry = p/3, K = 500, and nodesize = 5 Making predictions: After the random forest is built, it can be used to provide predictions of the response variable. To make predictions though, it is necessary to feed the method inputs that have never been seen before during the construction process. Due to bootstrap sampling, we can refrain from keeping aside in advance a portion of the original data set for testing purposes. The reason for this is that each tree uses more or less two thirds of the observations, from now on called in-bag observations, whilst the remaining one third of the observations are never used to build a specific tree, from now on called out of bag (OOB) observations. For each tree, the out of bag observations act as a separate test set To predict the response variable value for the n th observation, one should drop its corresponding input down every single tree in which this observation was out of bag. This means that by doing so one will end up having in hand on average K/3 predictions for any n = 1,... , N observation. Then, in order to derive a single response prediction for the n th observation, the average of these predictions is taken. The same procedure is repeated for all other observations. Whether these predictions are good enough or not needs to be evaluated based on certain metrics Performance metrics: In effect, MSEOOB is a sound approximation of the test error for the random forest because every single data point is predicted based solely on the trees that were not constructed using this observation. Actually, when the number of trees K is very large then the MSEOOB is roughly equivalent to leave one out cross validation. Permutation importance: The central idea of permutation importance, is to measure the decrease in the prediction accuracy of the random forest resulting from randomly permuting the values of a predictor. The method provides a ranking for predictors’ importance as end result and it is tied to a prediction performance measure. In particular, the permutation importance for xp predictor is derived as follows. For each of the K trees: firstly, record the prediction error MSEOOBk secondly, noise up, i.e. permute (rearrange), the predictor xp in the out of bag sample for the k th tree; thirdly, drop this permuted out of bag sample down the k th tree to get a new MSExpperm OOBk after the permutation and calculate the difference between these two prediction errors (before and after the permutation). In the end, average these differences over all trees. The larger the Ixp the stronger the ability of xp to predict the response. Generally speaking a positive permutation importance is associated with decrease in prediction accuracy after permutation whilst negative permutation importance is interpreted as no decline in accuracy. Minimal depth: The other approach for measuring predictors importance is based on measure named minimal depth. The minimal depth shows how remote a node split with a specific predictor is with respect to the root node of a tree. Thus, here the position of a predictor in the k th tree determines its importance for this tree. The latter means that unlike permutation importance, the importance of each predictor is not tied on a prediction performance measure. Also, in addition to ranking variables, the method also performs variable selection - a very useful feature for elimination of less important predictors. The concept of minimal depth has been formulated based on the notion of maximal sub-tree for feature xp. The latter is defined as the largest sub-tree whose root node is split using xp. In particular, the minimal depth of a predictor xp, a non-negative random variable, is the distance between the k th tree root node and the most proximate maximal sub-tree for xp, i.e. the first order statistic of the maximal subtree. It takes on values {0,... , Q(k)} where Q(k) the depth of the k th tree reflects how distant is the root from the furthermost leaf node, i.e. the maximal depth. A small minimal depth value for predictor xp means that xp has high predictive power whilst a large minimal depth value the opposite. The root node is assigned with minimal depth 0 and the successive nodes are sequenced based on how close they are to the root. The minimal depth for each predictor is averaged over all trees in the forest. the distribution of the minimal depth can be derived in a closed form and a threshold for picking meaningful variables can be computed, i.e. the mean of the minimal depth distribution. In particular, variables whose forest aggregated minimal depth surpasses the mean minimal depth ceiling are considered irrelevant and thus could be excluded from the model. However, variable selection using the minimal depth threshold is more meaningful for problems with high dimensionality. Summary: Decision trees are simple and interpretable. But they can not lead to high prediction accuracy. Bagging and random forests among other tree-based algorithms (e.g. boosting) are developed to improve the prediction accuracy of trees. Random forests are among the state-of-art methods for regression or classification (supervised learning). However, their results can be difficult to interpret 2.3 – Neural network Introduction to NN Neural networks (NNs) are computer systems that are often said to operate in a similar fashion to the biological. neural networks, such as the human brain. According to the universal approximation theorem, proved by Cybenko in 1989, a NN with 1 hidden layer, often called a shallow NN, and a nonlinear activation can approximate any function with arbitrarily small error, i.e. can in principle learn anything. However, in such cases the hidden layer might need to be very large, and thus it won’t be useful for learning a model that generalizes well to new data. Instead, one prefers to construct multilayer deep NNs where each layer computes certain patterns which are then combined by the next layer in order to form more complex patterns which lead to more accurate predictions. Neurons and layers: To understand the term deep in deep learning, we need to first understand how ANNs are structured. All the circles here are called neurons or nodes. Each ’column’ (neurons that are in the same color) are organized into what we called layers. Layers between input layer (cyan) and output layer (yellow) are called hidden layers. If there are more than one hidden layer, the ANN is called a deep ANN. The layer structure used in the previous example and the examples below is called dense layer, which are also known as fully connected layers. As the name suggests, it fully connects each input to each output Classes of Neural Networks There are many classes of NNs and many have sub-classes. The three most commonly used classes of NNs are feedforward Neural networks, convolutional neural networks. Recurrent Neural Networks Convolutional Neural Networks They are variants of multilayer perceptrons, designed to take in an input image, assign importance (weights) to various objects in the image and be able to differentiate one from the other. For example Recurrent neural networks They are types of ANNs which use sequential data or time series data. They are commonly used for ordinal or temporal problems, such as language translation, natural language processing (nlp), speech recognition,and image captioning. They are incorporated into popular applications such as Siri, voice search, and Google Translate. Like feedforward and convolutional neural networks (CNNs), recurrent neural networks utilize training data to learn. Feedforward Neural Networks They are ANNs in which the connections between the units do not form a cycle and in feedforward NNs information always moves forward The most basic feedforward NN, which contains only the input layer, is called perceptron. A perceptron takes several binary inputs, for example, x1,..., x5, and produces a single binary output. Can be multilayers too: multilayer perceptrons which are feedforward NNs with three or more layers including the input layer, the hidden layer(s) and the output layer. Each input to this ANN has three dimensions, e.g. age, height and weight. Two neurons in the output layer which means that there are two possible outcomes, e.g. overweight or underweight. Number of neurons in the hidden layer chosen is arbitrarily. Basic concepts Input layer: No computation is done, information is passed to the next layer. Hidden layers: They are layers between the input and the output layers which consist of a number of neurons or nodes which are processing units that compute a linear combination of the inputs from the previous layer and then this sum is passed to an activation function, which performs some type of transformation on the given sum. Output layer: Finally, an activation function that maps to the desired output format (e.g. softmax for classification, exponential for regression) is used. Connections and weights: The network consists of connections, each connection transferring the output of a neuron i to the input of a neuron j. Activation function: The activation function maps the node from the current layer to the node in the next layer. The nonlinear activation functions allow NNs to compute nontrivial problems using only a small number of nodes Typical activation functions: Each connection between two neurons has an associated weight, wi,j which represents the strength of the connection between the two neurons. Starting from the input layer, input is passed to the next neuron via a connection, and the input will be multiplied by the weight. Next, for each node in the second layer, a weighted sum is then computed with each of the incoming connections. This sum is then passed to an activation function f , which performs some type of transformation on the given sum. e.g. for the first neuron in the hidden layer. Then we repeat the whole process from the hidden layer to the output layer. In general, if there are multiple hidden layers, we will need to repeat the above process until we reach output layer. In this example Learning rule: It is a rule, or an algorithm, which modifies the parameters of the neural network in order for a given input to the network to produce a favored output. This learning process typically amounts to modifying the weights. The different learning rules in the NN are: Hebbian learning rule – It identifies, how to modify the weights of nodes of a network. Perceptron learning rule – Network starts its learning by assigning a random value to each weight. Delta learning rule – Modification in sympatric weight of a node is equal to the multiplication of error and the input. Correlation learning rule – The correlation rule is the supervised learning. Outstar learning rule – We can use it when it assumes that nodes or neurons in a network are arranged in a layer. Neural network architecture: It involves the choices of the depth of the network, the numbers of hidden neurons in each hidden layer, as well as activation function(s). Hyperparameters: Apart from the weights you also have the so called hyperparameters whose values are used to control the learning process. For a simple feedforward neural network, the hyperparameters are: number of neurons, number of layers, learning rate, regularization penalty to prevent overfitting, momentum is added if the NN approaches a shallow local minimum so that a global minimum is found, number of epochs is the number of iterations that should be increased until the validation accuracy starts decreasing even when training accuracy is increasing (overfitting), batch size is like a for loop as it defines the number of samples to work through before updating the internal model parameters Training algorithms: We train a model by minimizing a loss function, which is a way to quantify the difference between actual values from the data and predicted value from the model. For example: In linear regression model, given the pair of data (xi , yi), i = 1,..., n, we want to minimize the following In the previous two examples, the loss function was the mean squared error (MSE) There are other types of loss functions, we provide some examples: We start the training process by setting up arbitrary weights, w ℓ ij, between every neuron among all the layers ℓ, then we specify activation functions, f, we choose a loss function, L and we define the minimization problem Unlike computing the minimum, or maximum, for a basic function e.g. (x1) 2 , usually there is no way to minimise L in one iteration. Thus, we need some algorithms to train NNs. One of the commonly used algorithms would be Stochastic gradient descent (SGD). During each iteration, epoch, r = 1, 2,.... The loss function L(r) as well as the gradients d ℓ ij(r) based on the current weight wℓij(r) are calculated. Then, update all the weights based on the gradients Repeat above process until some stopping criterion is reached. For example, specify a maximum number of iterations rm and stop the process when r > rm. The whole procedure here is usually called training. During each iteration, we would say the NN is learning by updating the weights wℓij. The hyperparameter α is called learning rate. We have to test and tune with each model before we know exactly where we want to set it. Typical range of value would be in [0.0001, 0.01]. A Larger value of α could lead to overshooting, i.e. the updating steps (αd l ij) are too large and it may pass the minimum. By contrast, a smaller value of α would require more iterations before the loss function is minimized i.e. it would increase of computational time. Overall, the act of choosing between a higher learning rate and a lower learning rate leaves us with this kind of trade-off idea. Training, validation, and testing; We split the data into three distinct data sets. Training set is used to train the model - minimise the Loss function and find out the optimized weights wℓij After the training step, the model produced output based on the input in validation set. The weights will not be updated in this step. The output from the validation set is then used to adjust the learning parameters and check whether the model is overfitting. If the prediction results on validation set perform considerably worse than the training set, then the model is said to be overfitting. The concept of overfitting boils down to the fact that the model has learned the features of the training set extremely well, but if we give the model any data that slightly deviates from the exact data used during training, then the model is unable to generalize and accurately predict the output. There are several common ways to reduce overfitting: Adding more data to the training set. Data argumentation – creating additional augmented data by reasonably modifying the data in our training set (e.g. Rotating, Flipping, Zooming the image data set). Reduce model complexity (e.g. reduce number of hidden layers, number of neurons for some layers). Dropout – randomly ignore a proportion of nodes in a given layer during training There is also the problem of underfitting when the model is not able to predict the data it was trained on that we may need to deal with. There are several ways to reduce underfitting: Increase the complexity of the model. Add more features to the input samples. Reduce the proportion of dropout. Once encounter overfitting or underfitting, one need to try the measures suggested above and retrain the model. After finishing training the model and validating the model, the final step would be testing the model The test set provides a final check that the model is generalizing well before deploying the model to production. Note that a rule of thumb indicates that a neural network with 3 to 5 hidden layers can solve most of the high-dimensional problems. The prediction performance from the test set will be the final assessment of the model before deploying it to production. No matter how good or bad the performance on test set, one cannot go back to adjust the structure of the model and retrain the model, which should be done during training and validation.