Machine Learning in Finance: Decision Trees PDF
Document Details
Uploaded by RecommendedSunflower
University of St. Gallen
Despoina Makariou
Tags
Summary
This presentation provides an overview of decision trees, focusing on their application in finance. The lecturer explains how these models can be used to structure complex financial decisions, highlighting their ease of interpretation compared to linear models.
Full Transcript
Machine learning in Finance Trees Prof. Dr. Despoina Makariou Institute of Insurance Economics University of St. Gallen Learning outcomes Regression trees Classification trees 1 A Simple Financial Decision-Making Scenario Imagine you’re a port...
Machine learning in Finance Trees Prof. Dr. Despoina Makariou Institute of Insurance Economics University of St. Gallen Learning outcomes Regression trees Classification trees 1 A Simple Financial Decision-Making Scenario Imagine you’re a portfolio manager tasked with deciding whether to invest in a particular stock. What factors they would consider in a financial decision? 2 A Simple Financial Decision-Making Scenario Imagine you’re a portfolio manager tasked with deciding whether to invest in a particular stock. You have various factors at your disposal: the company’s profitability, its recent performance, economic trends, interest rates, and perhaps even global news. How do you decide which factors are most important to focus on, and what combination of them should lead to your final decision? 3 A Simple Financial Decision-Making Scenario Imagine you’re a portfolio manager tasked with deciding whether to invest in a particular stock. You have various factors at your disposal: the company’s profitability, its recent performance, economic trends, interest rates, and perhaps even global news. How do you decide which factors are most important to focus on, and what combination of them should lead to your final decision? Decision-making in finance is not always straightforward and often involves complex, data-driven decisions. 4 How can a decision tree help? A decision tree is like a flowchart for decision-making. It helps you break down complex decisions into smaller, more manageable choices. Each ’branch’ of the tree represents a question or a condition, like ’Is the stock’s quarterly earnings growing? Yes or No?’ Depending on your answer, you move to the next branch, asking another question until you reach a decision—buy, sell, or hold. 5 How can a decision tree help? A decision tree is like a flowchart for decision-making. It helps you break down complex decisions into smaller, more manageable choices. Each ’branch’ of the tree represents a question or a condition, like ’Is the stock’s quarterly earnings growing? Yes or No?’ Depending on your answer, you move to the next branch, asking another question until you reach a decision—buy, sell, or hold. Decision trees mimic human decision-making but in a structured, data-driven way. 6 A financial decision tree 7 Some Advantages and Limitations Advantages Easy to understand and interpret Non-linear relationships: Can capture more complex patterns than linear models Useful for feature importance: Identifying which factors (features) matter most in making decisions Limitations Overfitting: Decision trees can get too specific to the training data (low bias) Instability: Small changes in data can result in large changes in the tree structure (high variance) 8 Importance Why decision trees are important in Machine Learning? 9 Importance Decision trees are foundational models for more advanced machine learning techniques which combine multiple decision trees to improve accuracy and performance. 10 Decision trees, more formally... 11 Tree-based methods We focus on the tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting rules used to segment the predictor space can be summarised in a tree, these types of approaches are known as decision tree methods. 12 Baseball player summary data Salary is coloured from low (blue, green) to high (yellow, red). How would you stratify it? 13 Decision tree for the data R1 = {Years < 4.5}, R2 = {Years ≥ 4.5 and Hits < 117.5} and R3 = {Years ≥ 4.5 and Hits> 117.5}. 14 Details for the decision tree In keeping with the tree analogy, the regions R1, R2 and R3 are known as terminal nodes or leaves of the tree. Decision trees are typically drawn upside down, in the sense the leaves are at the bottom of the tree. The points along the tree where predictor space is split are referred to as internal nodes. In the hitters tree, the two internal nodes are indicated by the text Years < 4.5 and Hit < 117.5. We refer to the segments of the trees that connect the nodes as branches. 15 Interpretation of the decision tree results Years is the most important factor in determining Salary and players with less experience earn lower salaries than more experienced players. Given that a player is less experienced, the number of Hits that he made in the previous year seems to play little role in his Salary But among players who have been in the major leagues for five or more years, the number of Hits made in the previous year does affect Salary, and players who made more Hits last year tend to have higher salaries The figure is likely a over-simplification, but compared to a regression model, is advantageous in displaying and interpreting. 16 Details in the tree building process We divide the predictor/covariate space, i.e. the set of possible values for X1 , X2..., Xp into J distinct and non-overlapping regions R1 , R2 ,..., RJ. For every observation that falls into the region Rj , we make the same prediction, which is simply mean of the response values for the training observation in Rj. In theory, the regions could have any shape. Decision trees divide the predictor space into axis-aligned high dimensional rectangles for simplicity and interpretability. Our target is to find rectangles, R1 , R2 ,..., RJ that minimise the RSS. ∑J ∑ (yi − ŷRj )2 (1) j=1 i∈Rj −1 ∑ where ŷRj = Nj i:xi ∈Rj yi and Nj = #{i : xi ∈ Rj }. 17 More details in tree building process It is computational infeasible to consider every possible partition of the feature space into J rectangles. We take a top-down, greedy approach that is known as recursive binary splitting. Top-down: it begins at the top of tree and then successively splits the predictor space, each split is indicated via two new branches further down on the tree. Greedy: at each step of the tree-building process, the best split is made at the particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. 18 More details in tree building process We first select the predictor Xj and cutpoints such that splitting the predictor space into regions {X : Xj < s} and {X : Xj ≥ s} leads to the greatest possible reduction in RSS. Then, we repeat the process, searching for the best predictor and cutpoint to further split the data so as to minimise the RSS within each of the resulting regions. Note rather than splitting the entire space, we split one of the two previously identified regions. Now we have three regions. Again, we look to split one of these three regions further, so as to minimise the RSS. The process continues until a stopping criterion is reached, e.g. we may continue until no region contains no more than five observations. 19 Splitting the predictor space Generally we create the partitions by iteratively splitting one of the X variables into two regions. 20 Splitting the predictor space Generally we create the partitions by iteratively splitting one of the X variables into two regions. For instance: 1. First split on X1 = t1. 21 Splitting the predictor space Generally we create the partitions by iteratively splitting one of the X variables into two regions. For instance: 1. First split on X1 = t1. 2. If X1 < t1 , split on X2 = t2. 22 Splitting the predictor space Generally we create the partitions by iteratively splitting one of the X variables into two regions. For instance: 1. First split on X1 = t1. 2. If X1 < t1 , split on X2 = t2. 3. If X1 > t1 split on X1 = t3. 23 Splitting the predictor space Generally we create the partitions by iteratively splitting one of the X variables into two regions. For instance: 1. First split on X1 = t1. 2. If X1 < t1 , split on X2 = t2. 3. If X1 > t1 split on X1 = t3. 4. If X1 > t3 , split on X2 = t4. 24 Estimator for regression trees Consider the statistical learning model: Yi = f(Xi ) + ϵi , i = 1,..., n (2) where Xi = (xi1 , xi2..., xip )T. A regression tree T with leaf regions R1 ,..., RJ corresponds to the estimator f̂T such that for x ∈ Rp is J ∑ f̂(x) := YRj 1{x ∈ Rj } (3) j=1 ∑ ∑n where YRj := N−1 j i:Xi ∈Rj Yi and Nj = i=1 I(Xi ∈ Rj ). The training RSS of the regression tree T is J ∑ ∑ RSS(T) := (Yi − YRj )2 (4) j=1 i:Xi ∈Rj 25 Pruning a tree The tree building process may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. We can first grow a very big tree T0 and then prune it back to obtain a subtree. Cost complexity pruning, we consider a sequence of trees indexed by a tuning parameter α > 0. For each α, we seek a subtree Tα ⊂ T0 such that Tα = argminT⊂T0 {RSS(T) + α|T|} (5) where |T| indicates the number of leaves of T. 26 Choosing the optimal subtree α is the tuning parameter that controls a trade-off between the subtree’s complexity (variance) and its fit to the training data (bias). Optimal α can be obtained using cross-validation. We then return to the full dataset and obtain the subtree corresponding to the optimal α. 27 Algorithm for building a regression tree 1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations. 2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtree, as a function of α. 3. Use K-fold cross validation to choose α. That is, divide the training observation into K folds. For each k = 1,..., K: a. Repeat Steps 1 and 2 on all but the k − th fold of the training data. b. Evaluate the mean squared prediction error on the data in the left-out k − th fold, as a function of α. c. Average the results for each value of α and pick α to minimize the average error. 4. Return the subtree from Step 2 that corresponds to the chosen value of α. 28 Baseball data continued Randomly split the data: 132 training observations and 131 observations in the test dataset. We then build a large regression tree on the training data and varied α to create subtrees with different number of terminal nodes. Finally, a six-fold cross validation is implemented to estimate the cross-validated MSE of the trees as a function of α. 29 Baseball data regression tree 30 Cross validation 31 Classification trees For a classification tree, we predict that each observation belongs to the most common occurring class of training observations in the region to which it belongs. We use recursive binary splitting to grow a classification tree. We use recursive binary splitting to grow a classification tree. Recall that in a regression tree, we define the cost complexity criterion: |T| ∑ Cα(T) = RSS(T) + α|T| = Nj Qj (T) + α|T| (6) j=1 1 ∑ where QMSE j (T) = Nj i:Xi ∈Rj (Yi − YRj )2 is the MSE of the regression in the m − th terminal node (region). 32 Classification trees Under classification setups, we need to substitute the loss QMSE j by some alternative measures. Misclassification error rate: the fraction of training observations in the region that do not belong to the most common class: 1 ∑ QMER j (T) = 1{Yi ̸= ℓ̂j } = 1 − p̂j,ℓ̂j (7) Nj i::Xi ∈Rj ∑ where ℓ̂j = argmaxℓ p̂j,ℓ and p̂j,ℓ = N1j i::Xi ∈Rj 1{Yi = ℓ} represents the proportion of training observations in the j − th leaf region that are from the ℓ − th class. 33 Two other measures of loss However classification error is not sufficiently sensitive for tree growing and in practice two other measures are preferable. Gini index L ∑ QGini j (T) = p̂j,ℓ (1 − p̂j,ℓ ) (8) l=1 a measure of total variance across the K classes. Gini index is referred to as a measure of node purity — a small value indicates a node contains predominately observations from a single class. Cross-entropy (deviance): L ∑ Qxent j (T) = − p̂j,ℓ log p̂j,ℓ (9) l=1 These two measures are used to evaluate the quality of a particular split, since they are more sensitive to node purity 34 than the classification error rate. Comparison among the three measures For K = 2, the three measures are 1 − max(p, 1 − p), 2p(1 − p) and −plog(p) − (1 − p)log(1 − p), respectively. The cross-entropy and the Gini index are differentiable and hence also more amendable to numerical optimization. 35 Advantages and disadvantages of trees Trees are easy to explain to people. They are even easier to explain than linear regression. Some people believe that decision trees more closely mirror human decision making than do the regression and classification seen in previous lectures. Trees can be displayed graphically and easily interpreted even by a non-expert. Trees can easily handle qualitative predictors without the need to create dummy variables. Trees do not have the same level of predictive accuracy as some of the other regression and classification approaches. 36