MSF 503 Study Guide 7.2 PDF

Modelingin Finance using Excel and Python Ben Van Vliet Version 7.2 September 17, 2024 CHAPTER 1 INTRODUCTION TO EXCEL............................................................... 7 1.1 Getting Started with Excel................................................................................... 8 1.2 Spreadsheet Quality and Testing.......................................................................... 9 1.2.1 Test-driven Development............................................................................ 12 1.2.2 Spreadsheet Testing Methods..................................................................... 12 1.3 Excel Syntax....................................................................................................... 14 1.3.1 Working with Cells..................................................................................... 16 1.3.2 Working with Data...................................................................................... 18 1.3.3 Working with Dates and Times.................................................................. 20 1.3.4 Other Functionality..................................................................................... 21 1.3.5 Getting Price Data from Yahoo! Finance................................................... 22 1.3.6 Histograms in Excel.................................................................................... 22 1.4 Descriptive Statistics.......................................................................................... 24 LAB 1.1: Descriptive Statistics of Stock Returns............................................... 29 CHAPTER 2 INTRODUCTION TO PYTHON.......................................................... 30 2.1 Getting Started with Python............................................................................... 30 2.2 Memory Management........................................................................................ 33 2.1.1 Simple Syntax Example.............................................................................. 35 2.1.2 Control Structures Code Example.............................................................. 36 2.2 Python Data Structures....................................................................................... 37 2.2.1 Lists [ ]........................................................................................................ 39 2.2.2 Numpy Arrays............................................................................................. 40 2.2.3 Dictionaries { }........................................................................................... 43 2.2.4 Tuples.......................................................................................................... 43 2.2.5 Pandas Dataframe....................................................................................... 43 2.3 Python Data Sources.......................................................................................... 45 2.3.1 Reading.csv and.xlsx Files........................................................................ 45 LAB 2.2: Descriptive Statistics in Python........................................................... 47 LAB 2.4: Histograms in Python.......................................................................... 49 2.3.2 Writing and Reading Numpy Files............................................................. 50 2.3.3 Working with Pickle Files.......................................................................... 51 2.3.4 Working with Pandas Datareader............................................................... 52 2.6.5 Installing a New Package............................................................................ 53 CHAPTER 3 FINANCIAL DATA.............................................................................. 54 3.1 Types of Databases in Finance........................................................................... 54 3.2 Data Dictionary.................................................................................................. 54 3.3 Data Cleaning..................................................................................................... 55 3.3.1 Identify Required Cleaning Activities and Algorithms.............................. 56 3.3.2 Bad Data and Outliers................................................................................. 57 LAB 3.1: Winsorizing in Practice....................................................................... 58 3.3.3 The Point-in-Time Data Problem................................................................ 60 3.3.4 Synchronizing Data..................................................................................... 61 3.4 Rescaling Data.................................................................................................... 61 3.4.1 Normalizing................................................................................................ 61 3.4.2 Standardizing Data...................................................................................... 63 3.4.3 Difference Bins........................................................................................... 65 © 2024 Ben Van Vliet 2 3.5 Ranking in Practice............................................................................................ 66 3.6 Double Standardizing......................................................................................... 66 LAB 3.2: Ranking and Standardizing.................................................................. 68 3.7 Data Cleaning Checklist..................................................................................... 70 CHAPTER 4 DATABASES and SQL.......................................................................... 71 4.1 Database Management Systems......................................................................... 71 LAB 4.1: Working with SQLite Databases......................................................... 73 LAB 4.2: Working with the Finance.sqlite Database.......................................... 75 LAB 4.3: Working with the Options.sqlite Database.......................................... 77 4.2 Structured Query Language............................................................................... 79 4.2.1 Data Manipulation Language...................................................................... 79 SELECT.................................................................................................................... 79 WHERE Clause........................................................................................................ 81 ORDER BY.............................................................................................................. 82 LIKE......................................................................................................................... 83 Aggregate SQL Functions......................................................................................... 84 DISTINCT................................................................................................................ 85 GROUP BY.............................................................................................................. 85 HAVING................................................................................................................... 86 Mathematical Operations.......................................................................................... 86 CASE........................................................................................................................ 87 Aliasing..................................................................................................................... 87 Joining Tables........................................................................................................... 88 UNION...................................................................................................................... 89 INNER JOIN............................................................................................................. 89 OUTER JOIN............................................................................................................ 90 INSERT..................................................................................................................... 91 UPDATE................................................................................................................... 91 DELETE................................................................................................................... 92 4.2.2 Data Definition Language........................................................................... 92 Creating Views.......................................................................................................... 92 CREATE, ALTER, and DROP................................................................................. 93 LAB 4.4: Find the Average Log Return of IBM................................................. 96 CHAPTER 5 PROBABILITY DISTRIBUTIONS...................................................... 97 5.1 Continuous Distributions................................................................................. 100 5.1.1 Uniform Distribution................................................................................ 101 LAB 5.1: Uniformly Distributed Random Numbers......................................... 102 5.1.2 Exponential Distribution........................................................................... 103 LAB 5.2: Inter-Trade Times.............................................................................. 104 5.1.3 Triangular Distribution............................................................................. 105 LAB 5.3: Profit Estimates.................................................................................. 106 5.1.4 Normal Distribution.................................................................................. 107 LAB 5.4: Profit Estimates.................................................................................. 108 5.1.5 Lognormal Distribution............................................................................ 109 5.1.6 Chi-squared ( χ2 ) Distribution.................................................................. 110 5.1.7 Non-central Student t Distribution............................................................ 111 © 2024 Ben Van Vliet 3 5.2 Discrete Distributions....................................................................................... 113 5.2.1 Bernoulli Trials......................................................................................... 113 5.2.2 Binomial Distribution............................................................................... 114 5.2.3 Central Limit Theorem............................................................................. 115 LAB 5.5: Example of the CLT at Work............................................................ 117 5.2.4 Poisson Distribution.................................................................................. 118 LAB 5.6: Inter-Trade Times and the Number of Trades per Unit of Time....... 119 5.2.5 Empirical Distributions............................................................................. 120 LAB 5.7: Trade Data......................................................................................... 121 CHAPTER 6 STATISTICAL TESTS........................................................................ 122 6.1 t-Test of the Mean............................................................................................ 123 6.2 χ2-Test............................................................................................................... 124 6.3 Shapiro Wilk Test............................................................................................. 125 6.4 Anderson Darling Test..................................................................................... 125 6.5 Kruskal Wallace Test....................................................................................... 126 6.6 Kolmogorov-Smirnov Test.............................................................................. 126 6.7 Wald-Wolfowitz Runs Test.............................................................................. 129 6.8 Variance Ratio Test.......................................................................................... 132 6.9 Pearson Correlation Test.................................................................................. 133 6.10 Spearman Rank Correlation Test.................................................................. 133 6.11 Testing the Significance of the Correlation.................................................. 134 6.12 Durbin Watson Test...................................................................................... 135 CHAPTER 7 SIMULATION I: Generating Random Numbers................................ 137 7.1 Inverse Transform Method............................................................................... 137 7.2 Linear Congruential Generator......................................................................... 139 LAB 7.1: Generate Standard Uniform Variates Using LCG............................. 141 7.3 Generating Random Variates from Various Distributions............................... 143 LAB 7.2: Generating Random t-Stats................................................................ 145 7.4 Generating Correlated Normal Random Numbers........................................... 148 7.5 Accept-Reject Method...................................................................................... 150 CHAPTER 8 SIMULATION 2: Generating Price Paths.......................................... 153 8.1 Stochastic Processes......................................................................................... 153 8.2 Geometric Brownian Motion........................................................................... 157 8.2.1 Stochastic Differential Equation to Solution Using Ito’s Lemma................ 158 8.2.3 Simulating Stock Prices................................................................................ 161 8.3 Simulating Short Rates..................................................................................... 164 CHAPTER 9 SIMULATION 3: Options and Value at Risk..................................... 166 9.1 Options............................................................................................................. 166 9.2 Value at Risk.................................................................................................... 169 9.2.1 Risk Management..................................................................................... 170 CHAPTER 10 SIMULATION 4: Stochastic Volatility............................................... 174 10.1 Fat Tails........................................................................................................ 175 10.2 Stochastic Volatility Models........................................................................ 178 10.2.1 The Heston Model..................................................................................... 178 10.2.2 ARCH( 1 )................................................................................................. 180 10.2.3 GARCH(1,1)............................................................................................. 180 © 2024 Ben Van Vliet 4 10.2.4 HAR Model............................................................................................... 184 10.3 Estimating Volatility..................................................................................... 184 10.3.1 Close-to-Close Volatility.......................................................................... 185 10.3.2 Extreme Value Volatility Estimators........................................................ 185 10.4 Adding Past and Future Volatilities.............................................................. 187 CHAPTER 11 OPTIMIZATION 1: Simplex, GRG, and Evolutionary....................... 188 11.1 Simplex Algorithm....................................................................................... 188 LAB 11.1: Linear Simplex Algorithm................................................................. 190 11.2 Gradient Maximization................................................................................. 193 LAB 11.2: GRG Nonlinear Algorithm................................................................ 196 11.3 Evolutionary Algorithms.............................................................................. 199 LAB 11.3: Evolutionary Algorithm..................................................................... 201 11.4 Optimization Outcomes................................................................................ 203 CHAPTER 12 OPTIMIZATION 2: Optimizing Portfolios......................................... 205 12.1 Bond Portfolio Optimization........................................................................ 205 12.2 Equity Portfolio Optimization...................................................................... 207 12.3 The Sharpe Ratio.......................................................................................... 211 LAB 12.4: Capital Budgeting Example............................................................... 215 CHAPTER 13 OPTIMIZATION 3: Model Calibration............................................... 218 13.1 Minimizing the Sum of Squared Errors (SSE)............................................. 218 13.2 Maximum Likelihood Estimation................................................................. 218 LAB 13.1: MLE with a Poisson Distribution...................................................... 220 LAB 13.1: Optimizing ARCH and GARCH Parameters.................................... 223 LAB 13.2: Nelson-Siegel Model of the Yield Curve.......................................... 226 LAB 13.3: Seasonality and Holt-Winters Model................................................ 230 CHAPTER 14 REGRESSION...................................................................................... 233 LAB 14.1: Generating Random Data.................................................................. 236 LAB 14.2: Simple Linear Regression.................................................................. 237 14.1 Capital Asset Pricing Model......................................................................... 239 LAB 14.3: Calculating CAPM Beta.................................................................... 240 LAB 14.4: The Fama-French 3-Factor Model..................................................... 243 14.3 Fama-MacBeth Regression........................................................................... 248 14.4 Lasso, Ridge, and Elastic Net Regression.................................................... 250 CHAPTER 15 CLASSIFICATION.............................................................................. 253 15.2 Logistic Regression...................................................................................... 253 15.3 Performance Evaluation............................................................................... 257 15.3.1 Test-Train Split......................................................................................... 257 15.3.2 Confusion Matrix...................................................................................... 257 LAB 15.2: Logistic Regression Example............................................................ 260 CHAPTER 16 AGILE DEVELOPMENT.................................................................... 262 16.1 Iterations....................................................................................................... 265 16.2 Teams............................................................................................................ 266 16.3 Tools............................................................................................................. 266 16.3.1 Continuous Integration.............................................................................. 267 16.3.2 Automated Unit Testing............................................................................ 267 16.3.4 Acceptance Testing................................................................................... 267 © 2024 Ben Van Vliet 5 16.3.6 Test Driven Development......................................................................... 268 16.3.7 Project Velocity........................................................................................ 268 16.3.8 Design Patterns......................................................................................... 268 16.3.9 Domain Driven Design............................................................................. 269 16.3.10 Code Refactoring................................................................................... 269 16.4 Extreme Programming.................................................................................. 269 16.5 Scrum............................................................................................................ 270 16.5.1 Scrum Management.................................................................................. 270 16.5.2 Scrum Concepts........................................................................................ 270 16.6 Conclusion.................................................................................................... 271 © 2024 Ben Van Vliet 6 CHAPTER 1 INTRODUCTION TO EXCEL Please be aware that the financial industry will expect that you know everything in this book. You will need to study the material for the closed-book weekly quizzes and the exams. These will not require you to memorize code. (Coding is for the homeworks.) Rather, they will require you to articulate concepts and do math. However, coding leads to understanding. Finance is the study of how people make investment decisions. These decisions are always forward looking and subject to considerable uncertainy. We only care about the past if it can help us make better decisions about the future. A model is a representation of reality. We usually think of models as mathematical equations, though not all equations are models. But, graphs and pictures, algorithms and code, data tables, and words can also be models. This book combines finance and mathematical models and Python code. As representations particularly of social phenomena like finance, we recognize that: “All models are wrong. Some models are useful.” -George Box A big part of the problem with modeling in finance is that finance is not a natural science: 1. We cannot run controlled experiments to test theories scientifically. 2. We cannot have complete (or any?) faith in statistics in finance because the distributions are not stationary. The means and variances (a measure of uncertainty) are change over time. Nevertheless, models help us understand observed, or empirical, phenomena and potentially help us predict the future. Models also help us communicate ideas about how we think things work. So, models are a kind of vocabulary. Demonstrating your knowledge of finance depends upon communicating, or articulating, clearly how you think things work. If you are unable to articulate your ideas, people will think you don’t know what you’re talking about. So, you will need to become adept at five methods of articulation. Notice that all these articulate the same idea: © 2024 Ben Van Vliet 7 To be sure, getting good at all of these is hard. And, you may have little experience. But don’t let that stop you from getting started. Don’t worry about what you don’t know. Keep trying. Practice, practice, practice. You were smart enough to get into Illinois Tech. You’re smart enough to figure it out. And remember, the answer to every question is “I don’t know. Google it.” Alternatively, the right answer is often, “I don’t know. Try it out and see what happens.” This is a sophisticated learning technique known as hacking. We are all hackers! 1.1 Getting Started with Excel An Excel spreadsheet application contains a workbook, which is a collection of worksheets. Spreadsheets are software, which makes you a software developer. Mistakes, or bugs, can be very costly. They can even get you fired. BE CAREFUL! Spreadsheet errors are pervasive. As Dr. Ray Panko points out, “consistent with research on human error rates in other cognitive tasks, laboratory studies and field © 2024 Ben Van Vliet 8 examinations of real-world spreadsheets have confirmed that developers make uncorrected errors in 2%-5% of all formulas… Consequently, nearly all large spreadsheets are wrong and in fact have multiple errors… Spreadsheet error rates are very similar to those in traditional programming.”1 Be sure to test your spreadsheet calculations. Testing should consume 25%-40% of development time. You should be careful when using any Excel function as the formula used for calculation may be different from what you expect. Never assume that a function calculates something the way you think it does. Always verify the formula using the help files. READ THIS: Spreadsheet Error Horror Stories 1.2 Spreadsheet Quality and Testing The outcome of a modeling project can be: a.) A working piece of software deployed to drive business decisions; b.) A prototype, which is a temporary working implementation to be replaced, or used simply as c.) A requirements specification for software development in, for example, Python. In these cases, prototyping in Excel provides for: 1 Panko, Raymond R. 2006. “Recommended Practices for Spreadsheet Testing.” EuSpRIG 2006 Conference Proceedings. p. 73 ff. © 2024 Ben Van Vliet 9 1. Clear definitions of algorithms. Building a prototype forces you to clearly define and formulate calculations. If you cannot explain the system well enough to build a prototype, then there is little chance that they will be able to build it. 2. Clear definition of data requirements. Prototypes bring into focus what data inputs are needed for implementation and what outputs a working model will produce. 3. Requirements for graphical user interfaces. This forces discussion about how humans will interact with the software. The purpose of user interface (UI) prototyping is to ensure that the end user can change dynamic inputs and states, shut down, turn on/off, and select parameters. 4. A working application for regression testing. You will need your Excel prototype to test your Python implementation. The primary reason for creating an Excel prototype is to resolve uncertainties early in the modeling process: is it feasible? can it be built? An Excel prototype is useful for revealing and resolving ambiguity and incompleteness in the requirements. The presence of unclear requirements may create an expectation gap between the management’s vision and your vision. Prototyping makes requirements tangible by bringing use-cases to life and closes gaps in understanding of requirements. (Financial engineers may be the only ones able to understand the equations and symbols. Prototyping helps explain complex calculations to less quantitative team members.) Here are some tips toward effective prototyping:  Include internal or external documentation, consisting of: o A brief description of the model and who built it. o The version number. o Details about the input data used in the model. o Summary of model change requests, i.e. a model change control document. o Summary of logical assumptions.  For all members of your team, use the same style conventions, e.g. font, color coding, size, etc. © 2024 Ben Van Vliet 10  Focus initial prototyping efforts on poorly understood or risky areas. That is, prototype the hardest parts first. That way, if you can’t get it to work, you don’t waste time on the easy stuff. Remember that an ounce of prevention is worth a pound of cure. Reworking defective models can waste a lot of time. As a rule of thumb, time spent on prevention pays off (at least) 3-to-1 versus time spent on fixing defects.2 You can prevent errors by using spreadsheet design best practices. The six golden rules of spreadsheet design are: 1. Separate inputs, calculations and results. Place inputs in a separate part of the worksheet from calculations. Outputs should be in another part still. 2. Use one formula per row or column. 3. Refer to the above and to the left. 4. Use multiple worksheets for ease of expansion and for repeatable blocks. 5. Use each column for the same purpose throughout the model. 6. Include a documentation sheet that identifies key formulae. Grossman and Ozluk add the following: 1. Keep formulas simple to enable checking of intermediate calculations. 2. Use named cells and ranges. 3. Enter each input once. 4. Use absolute references. 5. Use cell-protection to prevent accidental modification of formulas. 6. Make the spreadsheet file read-only to prevent accidental overwriting. Also, we recommend enhancing the look of the spreadsheet with bold fonts, borders, shading, and graphs when displaying results. This will make the spreadsheet easier to read. © 2024 Ben Van Vliet 11 If your Excel spreadsheet is deployed, you will want to prevent users from introducing errors. I recommend that at a minimum you employ Tom Grossman’s six features: 1. Protect source code from user changes. 2. Prevent user customization. 3. Restrict linking user spreadsheets into the application. 4. Prevent users from sharing the application. 5. Provide easy control of protection features using a well-designed user interface. 6. Prevent user tampering with protection features by employing passwords.9 Following these guidelines will at least make it very difficult for users to mess around with the code and introduce errors or untested modifications that you might otherwise get blamed for. 1.2.1 Test-driven Development Test-driven development (TDD) is a software development methodology that relies on the repetition of, or iteration over, a three-stage cycle: 1. Create test cases that define desired functionalities. 2. Write the minimum amount of code that passes the test. 3. Refactor the code to acceptable standards, if necessary. That is, before you start your modeling project, you have to be able to answer the question: “How am I going to test this software?” 1.2.2 Spreadsheet Testing Methods In spreadsheet testing, you try several input values to see if they produce the correct results. © 2024 Ben Van Vliet 12 How do you know your spreadsheet is correct? This is the oracle problem. An oracle is a way to determine whether a piece of software passes a test. Given a test case (i.e. set of inputs), we compare the output of our software to the output of the oracle. There are three ways to solve the oracle problem: 1. Find an oracle, a credible example, either in a book or a research paper or a website that you can replicate in Excel. If your Excel code generates the same results as the oracle, you can be fairly confident your Excel code is correct. 2. If no oracle exists, then create, audit, and test an Excel spreadsheet implementation which you have a high degree of confidence in. 3. If no oracle exists, generate simulated data using known parameters, then develop your model and run it against the data. If the model returns the same parameters known to be in the data, you can be fairly confident your code is correct. In the next chapter, we will begin to develop Python code. How do you know your Python code is correct? If your Python code generates the same results as your Excel spreadsheet (which generates the same results as your oracle), you can be fairly confident your Python code is correct. Regression testing is what you do after you make a change to a spreadsheet of Python code to ensure that it works the same as before. You use a suite of input values previously used in the model and compare the values from before the software was changed. The previous version of the spreadsheet is the oracle. In spreadsheet inspection, you examine all formula cells. Logic inspection before you convert your model to Python allows errors to be detected while the cost of fixing them is small. If you’re working on a team, team inspection can be used, where you check each other’s work. © 2024 Ben Van Vliet 13 As said, testing, in its various forms, should consume 25%-40% of your model development time. But, realize that all these preventative measures and testing methods can never be assumed to eliminate all errors. That is why I encourage iterative development. The more iterations, the more opportunities for uncovering and removing errors. 1.3 Excel Syntax There are many, many built-in functions in Excel. If you are unfamiliar with them, click on the fx icon to peruse all the functions. You do not have to memorize anything. Some of the Excel functions are related to financial topics, for example, IRR(), DURATION(), NPV(), PMT(), and ACCINT(). I’m not really interested in these. Here is a brief list of built-in Excel functions you should familiarize yourself with: AVERAGE COVAR EXP LOG NORM.S.DIST SUM CORREL DATE IF MAX/MIN RAND TRANSPOSE SQRT DAYS360 INTERCEPT MINVERSE SLOPE VAR.S COUPDAYSBS DURATION LINEST MMULT STDEV.S VLOOKUP © 2024 Ben Van Vliet 14 You should look at the descriptions and help files of these functions as well as the many others. You will not be able to memorize all the Excel functions, so the key is to know what kinds of functions are available in Excel and find them quickly. To use a function, simply type = into cell, then the function name, then the parameters, or input arguments. For example: = EXP( 5 ) The return value of the function will appear in the cell. Some functions require an array of data as a parameter. This is accomplished using ranges. Usually we set these parameters by pointing and clicking on a cell, rather than by typing the code. For example, given values in cells A1 through A5, the sum of these values can be calculated in cell B1 as: = SUM( A1:A5 ) Some functions accept more than one parameter, which are separated by commas. If a cell used as an input argument is in another worksheet, the cell reference is preceded by the sheet name. For example: = IF( Sheet1!A1 > 0, 1, 0 ) Matrix functions often return arrays. To make this work, highlight the entire range, type in the function, and press Control-Shift-Enter. For example, given a matrix of values in cells A1 to B2, we can put the transposed matrix into cells C1 to D2 by highlighting C1 to D2, entering the formula, and then pressing Control-Shift-Enter: = TRANSPOSE( A1:B2 ) © 2024 Ben Van Vliet 15 Other matrix functions include MMULT, MINVERSE, and MDETERM. 1.3.1 Working with Cells So far we have looked at absolute cell referencing. An absolute reference uses the column letter and row number. There is also relative cell referencing, usually a bad idea, shows the number of rows and columns up and to the left of the current cell. Often times we need to copy formulas over some range. This is easily done, by left-clicking on the square in lower right-hand corner of the highlighted cell and dragging downward. In this example, cell B1 contains the formula: = SUM( $A$1:A1 ) After copying this formula down, cell B8 contains: = SUM( $A$1:A8 ) Notice that the first cell in range reference is locked using the dollar signs, $A$1. This means that as you copy formulas to adjacent cells, neither the column nor the row in the first reference will change. Clicking on F4 iterates through the four possible © 2024 Ben Van Vliet 16 combinations of fixed cell references—A1, no lock; $A1, only column locked; A$1, only row locked; and $A$1, both column and row locked. Sometimes we like to use named ranges (which I think are usually a bad idea). Rather than using cell references then, we can use the range name as a parameter instead. For example, we can name cell A1 as Rate in the name box. Then, in cell B2 we can use this value as a parameter thusly: = EXP( Rate ) For example, take present value (PV), which is the value today of a series of cash flows received in the future. CashFlow Present Value  1  Rate  time The cash flows (CF) in the future are less valuable than the same amount of cash today, and the discount rate adjusts for the riskiness of the future cash flow. The higher the risk, the higher the discount rate. You can always do your own math in Excel using operators to implement a formula. For example, the present value equation can be implemented as follows: = 100 / ( 1.08 ) ^.25 The net present value (NPV) is the present value minus the cost of acquiring the asset, essentially just a negative cashflow at t = 0. But, Excel has an NVP function built in. Be careful! Never assume Excel calculates something the way you think it does. © 2024 Ben Van Vliet 17 n CFt Net Present Value  CF0   t 1 (1  r )t In this example, the investment is worth taking if an only if the initial investment is less than NPV. That is, NPV > 0 is the decision rule. 1.3.2 Working with Data Often times we use Excel to store data. This is convenient especially in light of Excel’s look-up functionality, which searches for a value in a data table based upon a condition. Here is a simple example borrowed from Investopedia2: A B C D E F 1 Data U.S. Treasury Rate Bond Benchmark Benchmark Yield 2 3 2 Year 3.04 XYZ Corp. 30 Year 4.63 4 5 Year 3.86 ABC Inc. 2 Year 3.04 5 10 Year 4.14 PDQ & Co. 5 Year 3.86 6 30 Year 4.63 MNO, Ltd. 10 Year 4.14 2 See “Microsoft Excel Features For The Financially Literate,” by Barry Nielsen, CFA. © 2024 Ben Van Vliet 18 In this example, cell F3 contains the following formula: = VLOOKUP( E2,$A$3:$B$6, 2, False ) In VLOOKUP function call (V stands for vertical, there is also HLOOKUP for horizontal), E2 is the look-up condition. A3:B6 is the table range. 2 indicates to compare the look-up condition to column one in the data table and return the corresponding value from column two. True as the last parameter compares on an exact or approximate match is returned. False returns only an exact match. The values in the data table must be in ascending order, or it may not work right. © 2024 Ben Van Vliet 19 1.3.3 Working with Dates and Times If we enter a date into Excel, say 01/09/2011, we can format its appearance in several ways by right-clicking and selecting Format Cells… However, Excel itself keeps track of the data as an integer value. We can use Excel’s built-in date functions to perform date calculations. For example, to find the amount of time between two dates: A 1 1/9/2011 2 7/18/2011 3 190 4.5205 Cell A3 contains the number of days between the two dates using either: = A2 – A1 = DATEDIF( A1, A2, "d" ) The formula in cell A4 is: = YEARFRAC( A1, A2, 3 ) For information on the parameters for the YEARFRAC function, see the Excel help files. © 2024 Ben Van Vliet 20 1.3.4 Other Functionality You should also familiarize yourself with other capabilities of the Excel visual development environment. Most often, there is more than one way to accomplish any particular task. And, many times there are wizards and visual cues to walk you through development. For example, some of the following are also available by highlighting an Excel range and right-clicking. What How Description File Menu or Saving Files Opening and saving Excel files Toolbar Edit Menu or Copy / Paste Editing cells and worksheets Toolbar Format Menu or Cell Formatting Changing cell appearance, also setting decimal style Toolbar Excel Options Tools | Options Changing default Excel settings Charts Chart Wizard Creating charts in Excel using Insert | Charts Sorting Data Data | Sort or Icon Sorting data ascending or descending Auditing Formulas | Trace… Trace cells that are precedents and dependents Developer | Visual VBA Editor Launch the VBA development environment Basic Editor We will look at more Excel functionalities over the course of this study guide. © 2024 Ben Van Vliet 21 1.3.5 Getting Price Data from Yahoo! Finance Historical stock price data is available for free on Yahoo! Finance. To get data:  Go to Yahoo! Finance.  Type in the symbol IBM.  Click on Historical Prices.  Select a date range, say Jan 1, 2023 to Dec 31, 2023. Check Daily data. Click Get Prices.  At the bottom, click Download to Spreadsheet and save to IBM.csv. This.csv file will open automatically in Excel. 1.3.6 Histograms in Excel A histogram is a bar chart that shows the frequency distribution of a data set. That is, the height of each bar in a histogram is determined by the number of data points in the data set that fall into each bin. The total area of the histogram is the number of data points. In Excel, one way to calculate the widths of the bins particularly for continuous data or discrete data where there are a large number of possible outcomes is: max( x)  min( x) bin size  desired number of bins Usually, for continuous data or a data where there is a large number of possible discrete outcomes, I find that between 20 and 30 bins works out well. Consider the following random data: A B C D E 1 6 7 5 9 8 2 8 8 3 0 3 3 1 7 6 5 3 © 2024 Ben Van Vliet 22 4 6 2 3 4 3 5 6 8 8 8 1 And the following discrete bins: F 1 Bins 2 0 3 1 4 2 5 3 6 4 7 5 8 6 9 7 10 8 11 9 Click in Data | Data Analysis | Histogram. Populate the histogram window as shown: The output shows the frequency and cumulative percentage distributions. G H I 1 Bin Frequency Cumulative % 2 0 1 4.00% 3 1 2 12.00% 4 2 1 16.00% © 2024 Ben Van Vliet 23 5 3 5 36.00% 6 4 1 40.00% 7 5 2 48.00% 8 6 4 64.00% 9 7 2 72.00% 10 8 6 96.00% 11 9 1 100.00% 12 More 0 100.00% Notice in the chart output that the cumulative percentage (F( x )) has a range of 0 to 1. Histogram 7 120.00% 6 100.00% 5 Frequency 80.00% 4 60.00% 3 40.00% Frequency 2 Cumulative % 1 20.00% 0 0.00% Bin If we divide the frequency of each outcome by the total count of (in this case) n = 25, we get the percentage that each outcome in the sample represents. The sum of these percentages, or what will be probabilities, is equal to one. We end up with a probability distribution. We often try to fit a probability density function, f ( x ), to the data in a probability distribution. From there, the cumulative density function, F( x ), can be estimated through summation of the relative frequencies. 1.4 Descriptive Statistics Recall that in a probability distribution, the first raw moment is the mean, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis. Consider the following example spreadsheet data. © 2024 Ben Van Vliet 24 A B 1 1 2 2 2 -1 3 -2 -2 4 0 1 5 -1 0 6 2 2 Measures of Central Tendency: Mean: The mathematical average which is the sum of all the values divided by the number of values. Median: The value that lies in the middle of all values after sorting low to high. Mode: The most commonly occurring value. = AVERAGE(A1:A6) = MEDIAN(A1:A6) = MODE(A1:A6) Measures of Dispersion (or Width): Range: The difference between the max value and the min. Quantile: also called Percentile, and often defined as Quartiles (fourths), Quintiles (fifths), Deciles (tenths), etc. The value at which some percentage of values below that value falls. Interquartile Range (IQR) is the difference between the third quartile and the first quartile, so that 50% of the values lies in the range. Variance: The squared deviation of each value from the mean. That is, it is the average of all the squared differences from the mean. Standard Deviation: The square root of the variance. It is used in finance as a proxy for volatility or risk. © 2024 Ben Van Vliet 25 = MAX(A1:A6) - MIN(A1:A6) ‘ Find the range = PERCENTILE( A1:A6, 0.5 ) ‘ Value at the 50% percentile = VAR.S(A1:A6) ‘ Calculate the sample variance = STDEV.P(A1:A6) ‘ Calculate the population standard deviation Measures of Shape: Skew: A measure of the asymmetry in the distribution either left or right. Usually, if median > mean, then the distribution is negatively skewed. If median < mean, then the distribution is positively skewed. This is not always the case, but usually. Kurtosis: A measure of the weight of the tails of the distribution relative to the mean. The standard normal distribution has kurtosis of 3. We are often interested in the excess kurtosis, relative to that of the standard normal. = SKEW.P(A1:A6) = KURT(A1:A6) © 2024 Ben Van Vliet 26 As a finance example, consider the compound interest equation A = P( 1 + r )t, where n is the number of times the interest compounds per year. As n “goes to infinity” or n → ∞: nt  r e  lim 1   rt n   n Then the continuous compound interest formula that we use for stocks is: St  Soe rt Solving for r, the log return (where t usually is 1), gives us: r  ln( St / S0 ) If we make a histogram of log returns over time, it will look something like this: The reality is that the histograms of log returns r in financial markets are leptokurtotic. Measures of Dependence: Covariance: A measure of the degree to which two random variables (i.e. returns on two assets) tend to deviate from their own means in tandem. Correlation: Pearson’s correlation is a linear measure of the degree to which two variables tend to move together. = COVARIANCE.P(A1:A6,B1:B6) = CORREL(A1:A6,B1:B6) © 2024 Ben Van Vliet 27 Spearman's Rank Correlation: A non-parametric measure of statistical dependence between two random variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other. A simple procedure is normally used to calculate Spearman’s correlation. The n raw scores are converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation on the two variables are calculated. If there are no tied ranks, then ρ is given by: 6   d i2   1 n  (n 2  1) = RANK.AVG(A1,$A$1:$A$6,0) ‘ C1 to C6 = RANK.AVG(B1,$B$1:$B$6,0) ‘ D1 to D6 = C1-D1 ‘ E1 to E6 = E1^2 ‘ F1 to F6 A B C D E F 1 1 2 3 1.5 1.5 2.25 2 2 -1 1.5 5 -3.5 12.25 3 -2 -2 6 6 0 0 4 0 1 4 3 1 1 5 -1 0 5 4 1 1 6 2 2 1.5 1.5 0 0 7 Sum: 16.5 In the data in both columns A and B, 2 is repeated. In columns C and D, the RANK.AVG function assigns the average rank of 1.5 for both. Then, the Spearman’s rank correlation is: 6 16.5   1  0.5286 6  (36  1) © 2024 Ben Van Vliet 28 LAB 1.1: Descriptive Statistics of Stock Returns Get price data for Microsoft (MSFT) and Intel (INTC), 1/1/2015 through 12/31/2023 from Yahoo! Finance.  Calculate the daily log returns for each stock, where the log return for day t is the natural log of the closing stock price S on day t divided by the closing stock price S the day before t – 1. rt  ln(St / St 1 )  Calculate the average daily return for each.  Calculate the total return for each stock (hint: log returns are additive).  Calculate the daily variances and standard deviations of returns.  Calculate the annualized volatility of each stock (hint: variance grows linearly with the number of days, standard devitation grows with the square root of the number of days).  Calculate the covariance and correlation of returns between the two stocks.  Create a line chart showing the prices of each stock.  Create a histogram of the returns. How do you know your answers are right? © 2024 Ben Van Vliet 29 CHAPTER 2 INTRODUCTION TO PYTHON 2.1 Getting Started with Python For this class we will be using:  Anaconda Navigator 3.2  Spyder Editor 5.5.1  Python 3.12 You can install the Anaconda Individual Edition for free here. There are 3 ways we can write Python code: 1. In Console 1/A window, we can write command line code. © 2024 Ben Van Vliet 30 2. In the editor window (temp.py or untitled.py), we can write programs. 3. In the editor window, we can also write scripts. In this class, we will be writing scripts that encapsulate financial models. Before we run the script, it’s good practice to go to:  On the menu bar, click Run | Configuration per file…  Then, check “Remove all variables before execution” Save the file, and then run the program by clicking. The output of the script will show in the Console 1/A window: So, now let’s take a look at writing Python code. © 2024 Ben Van Vliet 31  Try not to get frustrated. Sometimes every line of Python code can feel like a research project.  Google is your friend. There are millions of pages of help files and sample Python code on the internet. Getting good at getting help is a big part of being a good Python programmer.  Simplicity and clarity are virtues. You write code for three audiences: 1. The compiler / interpreter, which won’t compile / run incorrect code. 2. Yourself, when you try to read your own code later. 3. Your customer, who wants to understand how your code works. None of these is going to be impressed by complex code that shows how smart you are (or think you are).  Practice. Do not try to memorize code. The goal is not to know code. The goal is to use Python code to finish projects successfully. We all learn by hacking. So, hack away.  There is no one right answer. For any coding project, there are an infinite number of ways to satisfy the requirements successfully in code. They can all be correct. Once you become more accomplished, you will learn that some solutions may be better than others—fewer lines of code, faster execution, clearer logic. In this class, we won’t worry about these. You can find The Python Tutorial here: docs.python.org/3/tutorial/ However, there are many, many other websites that have Python code examples and explanations. By and large, don’t waste your time reading about code. You learn to code by coding. If your fingers aren’t moving, you’re doing something wrong. Find code examples. Type them in. Get them to work. Repeat. There are thousands of open-source, third party packages for Python. Some of these packages have types and functionality important for financial modeling. It is important to note that there are many overlaps between packages, and therefore, many ways to accomplish the same task. Of course, we will not be able to look at all the ways to do things. © 2024 Ben Van Vliet 32 We will stick with the well-known packages. There are many, many packages that are wrong, or broken, or out of date. Some important packages that are included in Anaconda that we will be using are: numpy, for its array object; scipy, which is a collection of mathematical algorithms; matplotlib, which is used to produce graphs and figures; and pandas, which enables data analysis in Python. At times we may need to install new packages. REMEMBER: GOOGLE IS YOUR FRIEND We will also be using other types, packages, modules, and functions. You should Google them to investigate! PYTHON IS A GENERAL PURPOSE LANGUAGE It is relatively easy to write. It is very hard to read! Simplicity and clarity are virtues! REMEMBER: CLICK THIS “ON” Run  Configuration per file…  Remove all variables before execution 2.2 Memory Management Python use objects, which include simple variables—integers, strings, Booleans— instances of classes, and data structures—lists, numpy arrays, dictionaries. Be aware that Python is:  Dynamically-typed, which means that every object name is bound at run-time to a type, unless it’s null. Types are not checked at design time or by a compiler.  Strongly-typed, which means that objects cannot be coerced into unrelated types implicitly, say from “9” (string) to 9 (int). You need an explicit conversion to do that. © 2024 Ben Van Vliet 33  Implicitly-typed, which means it uses type inferencing to determine types rather than explicit declarations. This will drive you crazy. Consider the simple line of code a = 7 in Python. Here the thing “a” itself is not an object. Rather, it’s a reference (i.e. a pointer, if you like C/C++). All Python references are generic. They are all the same. What distinguishes one references from another is the type of the thing it refers (or points) to, or (we say) are bound to at any particular time. A reference holds the address of a dynamically allocated object of some type. And, the thing it refers to can change. a=7 print( type( a ) ) a = "Hello" print( type( a ) ) b=a a = False print( a ) print( b ) In this case, the reference “a“ refers to (or is bound to) 7, which is of type integer. Then, a = “Hello”, which is a string. Then, a = False, a Boolean. © 2024 Ben Van Vliet 34 Nevertheless, when a reference is bound to a type, for ease of talking about things, we will simply say “a is an integer,” when really what we mean is “a refers to an integer.” In Python, the most commonly used data types are: Example Python Code Inferred Data Type x = "Hello World" string x = 20 int x = 20.5 float x = 1j complex x = [ "apple", "banana", "cherry“ ] list x = ( "apple", "banana", "cherry“ ) tuple x = range( 6 ) range x = { "name" : "John", "age" : 36 } dictionary x = { "apple", "banana", "cherry“ } set x = True bool When an object is no longer being referred to, it will be deleted, or deallocated. This happens through a process called garbage collection, which happens through reference counts. Whenever you create an object in Python, it has a reference count. An object’s reference count gets incremented whenever it is being referenced, and the count gets decremented it gets dereferenced. If the reference count gets to 0, the object is deallocated. 2.1.1 Simple Syntax Example There are millions of web pages that give examples in Python. You should get used to the ones you like. # import the math package import math # math package contains many functions such as # pow, exp, log (for natural log), sqrt, pi, e # define a function © 2024 Ben Van Vliet 35 def add( x, y = 5 ): # x and y are the parameters, and y has a default value return x + y # the body of the function is determined by indents # create a variable and use the math package x = math.pow( 2 + 4, 2 ) # define a variable. # x is implicitly typed. # print x print( x ) # call the function and print the return value print( add( 3 ) ) 2.1.2 Control Structures Code Example Let’s take a look at the if…else, for loop, and while loop syntaxes in Python. Add the following code to that above. a=5 b=2 # if...else statement syntax print( "\nIF...ELSE:\n" ) # \n is the line return character if a >= 5: print( "\tTRUE!" ) else: print( "\tFALSE!" ) # for loop syntax print( "\nFOR LOOP:\n" ) for i in range( 1, 10 ): print( i ) © 2024 Ben Van Vliet 36 # while loop syntax print( "\nWHILE LOOP:\n" ) while a < 10: print( "\t" + str( a ) ) # \t is the tab character a, b = b, a + b 2.2 Python Data Structures In this class we will look at the relevant in-memory data structures in Python and the ways we will be downloading data, reading data files, and accessing various data sources. In order to manipulate data in Python you must understand how data is represented in memory. Under the hood of Python is C++. In C++ an array is a set of contiguous memory locations. Assume that we have an array of integers { 7, 2, 6, 8, 1, 3 }, where the size of an integer in memory is 4 bytes. Each element in an array has three things: Notice that a 1D array has elements addressed 4 bytes apart. If we reshape this array in 2D with 2 rows and 3 columns, notice that the first dimension { 10, 22 } holds the address of the first element in each row. However, the underlying data has not changed addresses. It’s really just a matter of perception. © 2024 Ben Van Vliet 37 Likewise, if we reshape this array in 2D with 3 rows and 2 columns where the first dimension is now { 10, 18, 26 }, the data remains the same. Were we to ravel this 2D array, we would see the flattened array shows that the data is still in 1D. Try this simple script out that will show how this works using a numpy array. import numpy as np a = np.array( [ 7, 2, 6, 8, 1, 3 ] ) print( a ) print( a.shape ) # To use this array as a column vector a = a.reshape( 6, 1 ) print( a ) print( a.shape ) # To use this array as a row vector a = a.reshape( 1, 6 ) © 2024 Ben Van Vliet 38 print( a ) print( a.shape ) # We can reshape this into 2D a = a.reshape( 2, 3 ) print( a ) print( a.shape ) # Or, alternatively in 2D a = a.reshape( 3, 2 ) print( a ) print( a.shape ) # The raw data has not changed print( a.ravel() ) There are five data structures we will be using most often: lists, dictionaries, tuples, numpy arrays, and pandas dataframes. (While Python has an standard array type, we will not be using it. We always use numpy arrays.) Here is a very brief review of these five. 2.2.1 Lists [ ] Lists can contain elements of different types, though very often we use them to hold arrays of data. The notation is a little unsettling at times. Try these out. a = [ 3, 7, 2, 6, 4 ] print( a[ :2 ] ) print( a[ 3: ] ) print( a[ 2:4 ] ) © 2024 Ben Van Vliet 39 b = [ [ 4, 9, 2 ], [ 6, 5, 1 ], [ 8, 3, 7 ] ] print( b[ 1 ] ) print( b[ :2 ] ) print( b[ 1: ] ) 2.2.2 Numpy Arrays Numpy’s main object is the homogeneous, multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In numpy dimensions are called axes. For example, the coordinates of a point in 3D space [1, 2, 1] is a 1D array. It has one axis. That axis has 3 elements in it, so we say it has a length of 3. A 2-axis array has a first axis (rows) of length 2, and the second axis of length 3: [[1., 0., 0.], [0., 1., 2.]] Numpy also provides familiar math functions such as sin, cos, and exp. Numpy arrays support vectorized operations, such as element-by-element addition, so we don’t have to write for loops. In other words, numpy arrays are fast. We can define the data type for an array, but it can also be inferred, always as the minimum type required. It only takes one floating point number to require a float type array. import numpy as np # b is an integer array from a list b = np.array( [ 2, 4, 6 ] ) # Multiply b by 2 b *= 2 print( b ) print( b[ 1 ] ) © 2024 Ben Van Vliet 40 print( b[ :-1 ] ) print( b[ :2 ] ) print( b[ 1: ] ) print( b ) print( sum( b ) ) print( b[ 1: ] - b[ :-1 ] ) Now, append the following. # c is a float array from a list c = np.array( [ 3, 5.3, 9 ] ) d=b+c print( type( d ) ) print( d ) How about 2D arrays? Append the following. # e is an array of 51 floats 0-10 e = np.linspace( 0, 10, 51 ) print( e ) # f is an array of 5 zeros f = np.zeros( 5 ) print( f ) # f is an array of floats g=b*c print( g ) Now try this. # h is a 2D integer array from a 2D list © 2024 Ben Van Vliet 41 h = np.array( [ [ 1, 4, 5, 8 ], [ 9, 7, 4, 3 ] ] ) # i is a 2D array of ones as floats i = np.ones( ( 3, 4 ), dtype = float ) print( i ) # j is a 4x4 identity matrix j = np.eye( 4 ) print( j ) # k is h transposed k = h.T print( k ) What about matrix multiplication? # WARNING: Do not use * to do matrix multiplication # Use the dot() function l = np.dot( h, j ) print( l ) print( l.shape ) Here are some helpful techniques. # looping thru arrays uses this syntax m = sum( [ x ** 2 for x in b ] ) print( m ) # apply natural log function to elements in an array n = np.log( b ) print( n ) © 2024 Ben Van Vliet 42 2.2.3 Dictionaries { } A dictionary is an unordered collection of objects (values), “indexed” by unique identifiers (key), which can be strings or numbers. We say that each element in a dictionary is a key-and-value pair. data = { "IBM":[ 100, 101, 97, 99 ], "MSFT":[ 55, 59, 54, 56 ] } print( type( data ) ) print( data.keys() ) print( data.values() ) print( data[ "IBM" ] ) 2.2.4 Tuples A tuple consists of immutable values of different types separated by commas. However, a tuple may contain mutable types. t = 'IBM', 145.34, 14500 print( t[ 1 ] ) v = ( [ 1, 2, 3 ], [ 4, 5, 6 ] ) v[ 1 ][ 0 ] = 'X' print( v ) 2.2.5 Pandas Dataframe A Pandas dataframe is a two-dimensional, size-mutable table (i.e. rows and columns, or records and fields), where the columns can be of different data types. Let’s create a dataframe from the ground up. Suppose we want to manage time series stock © 2024 Ben Van Vliet 43 price data—symbol, date, closing price, and volume—in memory. (Don’t worry about the actual data for now.) Notice the various data types listed in the output of this program. import pandas as pd import numpy as np # Create some lists bardata = [ 1, 2, 3, 4 ] symbol = [ 'IBM' ] * 4 close = [ 100., 101., 99., 100 ] volume = [ 100 ] * 2 + [ 200 ] * 2 date = pd.date_range( '20170103', periods = 4 ) # Create a dictionary bardata = { 'Symbol':symbol, 'Date':date, 'Close':close, 'Volume':volume } # Convert the array to a dataframe df = pd.DataFrame( bardata ) # This function returns the first n rows for the object based on position print( '\nDATAFRAME:\n' ) print( df ) print( '\nTYPES IN THE DATAFRAME:\n' ) print( df.dtypes ) # Get a column from a dataframe vs = df[ 'Close' ] print( '\nCLOSE COLUMN:\n' ) print( vs ) # Add a new column to a dataframe with the log returns df[ 'Return' ] = np.insert( np.diff( np.log( df[ 'Close' ].tolist( ) ) ), 0, [ 0 ] ) print( '\nDATAFRAME WITH NEW COLUMN:\n' ) © 2024 Ben Van Vliet 44 print( df ) 2.3 Python Data Sources You should know how to access data in Python from:  Comma Separated Value (.csv) file  Excel (.xlsx) file  Numpy (.npy) file  Pickle (.pkl) file  Pandas DataReader 2.3.1 Reading.csv and.xlsx Files You can read.csv datafiles into dataframes easily. Download some historical data from Yahoo Finance and read it into a dataframe. import pandas as pd # Read a.csv file into a dataframe df = pd.read_csv( 'C:\\Python\\IBM.csv', sep = ',' ) print( df.head( 5 ) ) # Write a.csv file from a dataframe df.to_csv( 'C:\\Python\\IBM_backup.csv', sep = ',', index = False ) There are other ways to read.csv files. However, downloading data into a dataframe gives us the simplest method. However, sometimes there may be too much © 2024 Ben Van Vliet 45 data to load into memory. What if the data is a million rows? In this case we forward read the data, one row at a time. import csv with open( 'C:\\Python\\IBM.csv' ) as csvDataFile: csvReader = csv.reader( csvDataFile ) for row in csvReader: print( row ) We can also read and write to Excel files. # Reading an Excel file to a dataframe. # there is only one worksheet in the Excel file import pandas as pd df = pd.read_excel( 'C:\\Python\\IBM.xlsx', sheet_name = 'Sheet1' ) print( df.head( 5 ) ) © 2024 Ben Van Vliet 46 LAB 2.2: Descriptive Statistics in Python Using the returns for MSFT and Intel INTC from LAB 1.1, in Python:  Calculate the daily returns for each stock using the continuous compounding formula.  Calculate the average daily return for each.  Calculate the daily variances and standard deviations of returns.  Calculate the covariance and correlation of returns between the two stocks. How do you know your answers are right? import numpy as np import pandas as pd # read the MSFT data into a dataframe df1 = pd.read_csv( "C:\Python\MSFT.csv" ) # add a column to the dataframe and calculate the log returns df1[ 'log_ret' ] = np.log( df1[ 'Adj Close' ] / df1[ 'Adj Close' ].shift( 1 ) ) # print the descriptive statistics # numerical values may need to be converted to strings for printing using str() print( "\nMSFT Statistics:\n" ) print( "MEAN:\t" + str( df1[ 'log_ret' ].mean() ) ) print( "VAR:\t" + str( df1[ 'log_ret' ].var( ddof = 0 ) ) ) print( "STDEV:\t" + str( df1[ 'log_ret' ].std( ddof = 0 ) ) ) print( "SKEW:\t" + str( df1[ 'log_ret' ].skew() ) ) print( "KURT:\t" + str( df1[ 'log_ret' ].kurt() ) ) rets1 = np.asarray( df1[ 'log_ret' ].dropna() ) # read the INTC data into another dataframe and calculate the log returns df2 = pd.read_csv( "C:\Python\INTC.csv" ) df2[ 'log_ret' ] = np.log( df2[ 'Adj Close' ] / df2[ 'Adj Close' ].shift( 1 ) ) © 2024 Ben Van Vliet 47 # convert the dataframe to a numpy array rets2 = np.asarray( df2[ 'log_ret' ].dropna() ) # combine the two arrays of returns into a single, 2D array rets_all = np.stack( ( rets1, rets2 ), axis = 0 ) # print out the covar and correlation matrices print( "\nCOVAR Matrix:\n" ) print( np.cov( rets_all, ddof = 0 ) ) print( "\nCORREL Matrix:\n" ) print( np.corrcoef( rets_all ) ) © 2024 Ben Van Vliet 48 LAB 2.4: Histograms in Python Consider again the same data from Chapter 1.2.8. import numpy as np import matplotlib.pyplot as plt data = np.array( [ 6, 7, 5, 9, 8, 8, 8, 3, 0, 3, 1, 7, 6, 5, 3, 6, 2, 3, 4, 3, 6, 8, 8, 8, 1 ] ) # plot the histogram of the probability distribution plt.hist( data, density = True, bins = 10 ) plt.hist( data, density = True, bins = 10, histtype='step', cumulative = True) plt.show() © 2024 Ben Van Vliet 49 2.3.2 Writing and Reading Numpy Files Numpy includes a simple file format for numpy arrays. Numpy files (.npy) store data, shape, dtype and other information required to reconstruct the array. import numpy as np a = np.array( [ 1, 2, 3, 4, 5 ] ) np.save( 'C:\\Python\\np_test', a ) b = np.load( 'C:\\Python\\np_test.npy' ) print( b ) © 2024 Ben Van Vliet 50 2.3.3 Working with Pickle Files Pickling is the Python process that serializes objects or data structures into byte streams which can be saved as files or sent over a network. Unpickling is the reverse, where a byte stream is deserialized into an object or data structure. There are three pickling protocols: 0) ASCII protocol; 1) the old binary format; and 2) which provides the most efficient way to pickle. HIGHEST_PROTOCOL always ensures use of the highest available version. When we create pickle files, we will use the.pkl extension. import pickle from datetime import date import numpy as np data = { 'IBM': [ date( 2018, 1, 26 ), 1.45, 'AA-' ], 'MSFT': [ date( 2018, 6, 14 ), 0.98, 'AA' ], 'GM': [ np.NaN, 0.0, 'BB+' ] } with open( 'C:\\Python\\data.pkl', 'wb') as outfile: pickle.dump( data, outfile, pickle.HIGHEST_PROTOCOL ) with open( 'C:\\Python\\data.pkl', 'rb') as infile: data = pickle.load( infile ) print( data ) © 2024 Ben Van Vliet 51 2.3.4 Working with Pandas Datareader The Pandas datareader can automatically populate a dataframe from various internet data sources. The list of data sources can be found here: https://pandas- datareader.readthedocs.io/en/latest/remote_data.html Be aware, however, that not all the data sources actually work for free. Some may require additional licensing or subscription fees to use them. For example, Yahoo’s API no longer permits free downloads. This example accesses the quandl data source for historical price data. import pandas as pd pd.core.common.is_list_like = pd.api.types.is_list_like from pandas_datareader import data as dp import datetime start1 = datetime.datetime(2017, 1, 1) end1 = datetime.datetime(2020, 1, 27) df = dp.DataReader('IBM', 'quandl', start=start1, end=end1) print( df ) This example accesses the St. Louis FED (FRED) for historical housing data. import pandas_datareader.data as data import datetime import numpy as np start = datetime.datetime(2010, 1, 1) end = datetime.datetime(2020, 1, 27) df = data.DataReader('HOUSTNSA', 'fred', start, end) print( df ) arr = np.array(df) © 2024 Ben Van Vliet 52 plt.plot(arr) plt.show() 2.6.5 Installing a New Package Step 1: In your Windows Start menu, open Anaconda Prompt. Typically, we always want to use the conda install tool to install new packages. The conda install string for linearmodels package is: conda install -c conda-forge linearmodels However, this didn’t work for me, so I used the pip install to instead. Step 2: At the command prompt, type the pip install string: pip install linearmodels Step 3: Check to see if the linearmodels package installed correctly by importing it in a Spyder python program. © 2024 Ben Van Vliet 53 CHAPTER 3 FINANCIAL DATA “The data is always the problem.” 3.1 Types of Databases in Finance We won’t look at all of them in this text, but there are at least six types of financial data we use in finance:  Price Data consists of the bid and ask prices and quantities, and trade prices and quantities for securities and derivatives.  Valuation Data is different from price data. For some financial instruments— bonds, swaps, and all OTC derivatives—no price data exists, or if it does, it is a highly guarded secret. For these, valuation data is all there is. That is, the price exists only in theory, and, furthermore, is not a firm bid or offer that any market maker is obliged to honor.  Fundamental Data consists of everything that is disclosed in 10-Q quarterly and 10-K annual reports, including key business items, such as earnings, sales, inventories, and rents.  Calculated Data is data that is calculated from fundamental data, such as ROE, price to book, beta, forecasted dividends, free cash flow, etc. Index values, such as the Dow Jones and S&P 500 are calculated and broadcast in real-time.  Economic Data, such as CPI and GDP, are key indicators often used in financial analysis and trading.  Unstructured Data consists of things like news articles, pictures, Twitter feeds, and product reviews.  Sentiment Data quantifies the emotional content embedded in unstructured data. 3.2 Data Dictionary © 2024 Ben Van Vliet 54 As we have seen, we will be accessing data in different file formats. Regardless of the format, however, the data almost always comes in tables consisting of rows (or records) and columns (or fields). It is very important to create a data dictionary for every data set. A data dictionary contains the field names in a data set, along with their data types and descriptions. Here is an example of a data dictionary for the above data. Data Field Name Description Valid Values Type symbol string The symbol of the financial instrument. OIL, CHEM, IT, industry string The industry abbreviation. TRANS quantity int The number of shares owned. entry_price float The price the stock was bought at. The profit or loss since the stock was profit_dollar float bought. winlose char Whether the trade is a winner or a loser. W, L 3.3 Data Cleaning Good inputs are the key to success in financial modeling, and forecasting, as well as risk management, requires good, clean data for successful testing and simulation. Virtually no data, however, is perfect and financial engineers spend large amounts of time cleaning errors and resolving issues in data sets. It is very easy and very common to underestimate the amount of time preprocessing will take. © 2024 Ben Van Vliet 55 First, identify and categorize all the types of problems you expect to encounter in your data; then survey the available techniques to address those different types of errors; and finally develop methods to identify and resolve the problems. Data problems fall into one of four categories:  Bad, or incorrect, data.  Formatting problems.  Outliers, which skew results.  Point-in-time data problems. Types of Bad Data Example Bad values Tick of 23.54, should be 83.54 Missing data Blank field or data coded as “9999,” “NA,” or “0” Bad dates 2/14/12997 Column shift-data Value printed in an adjacent column File corruption Network errors Data from different vendors may come in different Different data formats formats or table schemas Missing fundamental data The company may have changed the release cycle Some things to look for algorithmically when attempting to clean a data set. Scanning for Bad Data Intra-period high less than closing price Intra-period low greater than opening price Volume less than zero Bars with wide high/low ranges relative to some previous time period Closing deviance. Divide the absolute value of the difference between each closing price and the previous closing price by the average of the preceding 20 absolute values Data falling on weekends or holidays Data with out-of-order dates or duplicate bars Price or volume greater that four standard deviations from rolling mean 3.3.1 Identify Required Cleaning Activities and Algorithms © 2024 Ben Van Vliet 56 All data, both real-time and historical, should be assumed to contain errors and issues. For example, data issues for high frequency systems will center more on clean tick data, whereas those for systems with longer-term holding periods will focus more on, say, dividends and releases of and revisions to financial statements. 3.3.2 Bad Data and Outliers In all cases cleaning of bad data is a process that consists first of detection, then classification of the root cause of the error, and then correction of the error. Whatever method you use to accomplish these must be shown to operate on both historical and real- time data. This matters because data cleaning algorithms can add latency to real-time systems. Algorithms that cannot be performed in real time prior to trade selection should not be used on historical data. Since many models are sensitive to even a few bad data points, I recommend looking carefully at means, medians, standard deviations, histograms, and minimum and maximum values of time series data. A good way to do this is to sort or graph the data to highlight values outside an expected range, which may be good (but outlying) or bad data. Outliers may good or real data points, but may obliterate the underlying structure, or information, otherwise present in the data. Brownlees and Gallo (2005) propose the following: Construct a z-score heuristic and then eliminating (or “trimming”) observations if the absolute value of the score is beyond some threshold. The z-score is computed using a moving average, moving variance, and a parameter g, which induces a positive lower bound on the heuristic variance, which is useful for handling sequences of identical prices as: But, why not winsorize the outliers, rather than trim them? For example, set all outliers to a specified percentile of the data. Say, any data point above the 95%ile would be “pulled in,” or set to the value that is the 95%ile. © 2024 Ben Van Vliet 57 LAB 3.1: Winsorizing in Practice Consider the following data set: 100, 98, 105, 100, 138, 98, 101, 97, 104, 95, 54, 101, 104, 99, 100, 109, 105, 100, 95, 99 The boss is concerned that outliers in the data may ruin a forecasting model. Write a Python script that will winsorize the outliers to the 5 / 95%ile. As discussed, do it in Excel first. Graphing the data shows us that there are outliers. Now, write the Python code so that you get the same result as you did in Excel. import numpy as np from scipy.stats import norm data = np.array( [ 100, 98, 105, 100, 138, 98, 101, 97, 104, 95, 54, 101, 104, 99, 100, 109, 105, 100, 95, 99 ] ) © 2024 Ben Van Vliet 58 avg = data.mean() stdev = data.std() pct_5 = norm.ppf(.05, loc = avg, scale = stdev ) pct_95 = norm.ppf(.95, loc = avg, scale = stdev ) pct_5 = round( pct_5 ) pct_95 = round( pct_95 ) for x in np.nditer( data, op_flags = ['readwrite'] ): if x < pct_5: x[...] = pct_5 if x > pct_95: x[...] = pct_95 print( data ) To be perfectly honest, I had to use Google four times in order to get this to work right. © 2024 Ben Van Vliet 59 Cleaning historical data is a trade-off between under and over-cleaning. What is bad data to you may not be bad data to me. What is clean at 10-minute intervals may be dirty at 10-second intervals. There is no single, correct way to clean data that applies to all scenarios. Models built on over-cleaned historical data may have problems with real- time data that contains errors. The objective is to manage the trade-off to produce a historical data series where problems are dealt with, but where the real-time properties of the data are maintained. 3.3.3 The Point-in-Time Data Problem Even “cleaned” data can have problems. Consider the following scenario: stock price data for a trading day is cleaned after the close of business. The next day, your data vendor sends out an adjustment file reflecting the corrected, or cleaned, price data. This occurs regularly due to block trades and other off-exchange trades being reported late, all of which become part of the day’s historical price data. However, you didn’t have the cleaned data while you were trading that day. If you use the cleaned data for a model, that’s not the data you would have had at the time. Further, with the new, cleaned price data, many calculations for derivatives prices from the previous day, which were based on the uncleaned data, will now be wrong. Implied volatility calculations will be wrong. All the historical prices of OTC derivatives that are based on the stock’s implied volatility will now be wrong. End-of-day rebalancing algorithms could also be wrong due to the uncleaned data used to run the models. So, cleaned data may not be better. An algorithm for calculating end-of-day prices is very important, but not as easy as it sounds. At the close, block orders affect prices, traders sometimes push bids and asks around, and small orders can move the market in the absence of late liquidity. Market fragmentation (there are at least 60 execution venues for stocks in the U.S.) make it difficult to define a single closing price for a stock. To calculate a clean closing price for a stock, it may be advisable to get a non-official closing price five minutes prior to the close across selected exchanges. Throw out all locked or crossed bids and asks, and then determine the weighted mid-point price. Futures prices are generally fairly clean since contracts trade only on one exchange. Option prices are very difficult to deal with, © 2024 Ben Van Vliet 60 however, due to market fragmentation, liquidity issues, and issues with the prices of the underlyings. 3.3.4 Synchronizing Data Databases of the different types listed above have different updating periods. Take splits, for example. Price data vendors update daily. Balance sheet data vendors may update their databases weekly. As a result, a given ratio, such as sales-to-price, may contain an unsplit sales figure and a split price. Fixing this problem is called synchronizing the data, accomplished by either buying synchronized data from a vendor or performing the task in-house. The real key to synchronizing data, or blending data, is a data map, sometimes called a Rosetta Stone. A Rosetta Stone is the set of unique identifiers used to link data and instruments across databases from various vendors. A proper Rosetta Stone allows the trading or investment firm to trade many instruments—stock, options, bonds, CDs, and OTC products—on a single underlying instrument. Further, developing unique identifiers across underlyings and across vendors enables the blending of proprietary data with purchased data. 3.4 Rescaling Data 3.4.1 Normalizing The process of normalization rescales a dataset over the range [ 0, 1 ]. xi  xmin xi ,norm  xmax  xmin While this may be useful in some cases, outliers will be lost. We use normalization when we want to put things in probability space, since probabilities range from 0 to 1. We can also normalize over some other range. Given raw data in column A, we can convert it to the new normalized score in column C. Start by sorting the raw data. © 2024 Ben Van Vliet 61 Given data x1 through xn where i = 1,..., n, the cumulative probabilities in column B are found as: xi  xmin zi  min   (max  min) xmax  xmin In this next table, given the ranks of the raw data in column A, we can convert it to the new standardized score in column C. The Excel formulae for generating the data are the same as in Table 1. The difference between simple ranking and distribution fitting is that using ranks is like fitting to a uniform distribution. Figure 1: Simple Ranking Fits to a Uniform Distribution © 2024 Ben Van Vliet 62 As can been seen from Figure 1, two ranks in the neighborhood of P( a ) will map the appropriate distance apart, as will two points in the neighborhood of P( b ), because of the constant slope of F( x ) in a uniform distribution. 3.4.2 Standardizing Data More often, what we want to do is standardization, which rescales the data set to express the values as z-scores, their number of standard deviations from the mean. Thus, we have to remove the mean from the training data (i.e. center it) and scale it to unit variance. xi   zi   Suppose that, given raw fundamental data (e.g. earnings per share, price-to-book ratio, etc.), we wish to standardize the data by fitting it to the normal distribution in this way, but say between plus and minus 2 standard deviations. (The probabilities associated with this range are 2.275% and 97.725% (using Excel’s NORMSDIST() function)). Figure 2: Fitting Ranked Data to a Normal Distribution © 2024 Ben Van Vliet 63 Fitting the ranks to a normal distribution is different. As can be seen in Figure 2, two points in the neighborhood of P( a )—such as data points with ranks 1 and 2—will map further away than will two points in the neighborhood of P( b )—such as data with ranks 455 and 456—because the slope of F( x ) not constant, and is steeper at b than a. So, distribution fitting takes differences in the ranks of observations (or in some cases the observations themselves), and imposes a distributional prior as to how much importance gaps in neighboring observations should have. The distributional method determines the importance of outliers. import sklearn.preprocessing as pp import numpy as np data = np.array( [ -.50, -.25, -.22, -.18, 0.0,.10,.20 ] ) data = data.reshape( 7, 1 ) # STANDARDIZE the data by removing the mean and scaling to unit variance scaler = pp.StandardScaler() std_data = scaler.fit_transform( data ) print( '\nSTANDARDIZED DATA:\n' ) print( std_data ) # NORMALIZE the data by removing the mean and scaling each data point to a given range mm_scaler = pp.MinMaxScaler( feature_range = ( -2, 2 ) ) mm_data = mm_scaler.fit_transform( data ) print( '\nSTD MINMAX DATA:\n' ) print( mm_data ) # NORMALIZE the ranked data the same two ways ranks = np.array( [ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0 ] ) ranks = ranks.reshape( 7, 1 ) std_ranks = scaler.fit_transform( ranks ) print( '\nNORMALIZED RANKED DATA:\n' ) © 2024 Ben Van Vliet 64 print( std_ranks ) mm_ranks = mm_scaler.fit_transform( ranks ) print( '\nSTD MINMAX RANK DATA:\n' ) print( mm_ranks ) 3.4.3 Difference Bins A variation on the ranking theme is to scale the ranks by the difference between data points, so that points with larger differences between them have a correspondingly large gap between their ranks. This is typically done by placing the differences into bins. The steps are as follows: Step 1: Sort the data from low to high. Step 2: Find the differences between points. Step 3: Put the differences into bins 1…m according to their size. That is, small differences go into bin 1, and the largest differences go into bin m. Step 4: To assign ranks to the data, give the smallest data point a rank of 1, and then add the bin value to each successive rank, so that each value gets assigned a rank that differs from the previous rank by the bin value. Thus, there will be gaps in the numbering. Step 5: Finally, proceed with distribution fitting as before. Here is a num

MSF 503 Study Guide 7.2 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue