MSF 503 Chapter 1 - Introduction to Excel (PDF)
Document Details
Uploaded by FeatureRichParabola
Illinois Tech
Ben Van Vliet
Tags
Summary
This chapter introduces spreadsheet modeling in Excel, focusing on practical applications and test-driven development. Spreadsheet testing techniques, prevention of errors, and using best practices are discussed. The chapter emphasizes the importance of clear definitions of algorithms, data requirements, and user interfaces, highlighting the need for effective prototyping before full implementation of models.
Full Transcript
CHAPTER 1 INTRODUCTION TO EXCEL Please be aware that the financial industry will expect that you know everything in this book. You will need to study the material for the closed-book weekly quizzes and the exams. These will not require you to memorize code. (Coding is for the home...
CHAPTER 1 INTRODUCTION TO EXCEL Please be aware that the financial industry will expect that you know everything in this book. You will need to study the material for the closed-book weekly quizzes and the exams. These will not require you to memorize code. (Coding is for the homeworks.) Rather, they will require you to articulate concepts and do math. However, coding leads to understanding. Finance is the study of how people make investment decisions. These decisions are always forward looking and subject to considerable uncertainy. We only care about the past if it can help us make better decisions about the future. A model is a representation of reality. We usually think of models as mathematical equations, though not all equations are models. But, graphs and pictures, algorithms and code, data tables, and words can also be models. This book combines finance and mathematical models and Python code. As representations particularly of social phenomena like finance, we recognize that: “All models are wrong. Some models are useful.” -George Box A big part of the problem with modeling in finance is that finance is not a natural science: 1. We cannot run controlled experiments to test theories scientifically. 2. We cannot have complete (or any?) faith in statistics in finance because the distributions are not stationary. The means and variances (a measure of uncertainty) are change over time. Nevertheless, models help us understand observed, or empirical, phenomena and potentially help us predict the future. Models also help us communicate ideas about how we think things work. So, models are a kind of vocabulary. Demonstrating your knowledge of finance depends upon communicating, or articulating, clearly how you think things work. If you are unable to articulate your ideas, people will think you don’t know what you’re talking about. So, you will need to become adept at five methods of articulation. Notice that all these articulate the same idea: © 2024 Ben Van Vliet 7 To be sure, getting good at all of these is hard. And, you may have little experience. But don’t let that stop you from getting started. Don’t worry about what you don’t know. Keep trying. Practice, practice, practice. You were smart enough to get into Illinois Tech. You’re smart enough to figure it out. And remember, the answer to every question is “I don’t know. Google it.” Alternatively, the right answer is often, “I don’t know. Try it out and see what happens.” This is a sophisticated learning technique known as hacking. We are all hackers! 1.1 Getting Started with Excel An Excel spreadsheet application contains a workbook, which is a collection of worksheets. Spreadsheets are software, which makes you a software developer. Mistakes, or bugs, can be very costly. They can even get you fired. BE CAREFUL! Spreadsheet errors are pervasive. As Dr. Ray Panko points out, “consistent with research on human error rates in other cognitive tasks, laboratory studies and field © 2024 Ben Van Vliet 8 examinations of real-world spreadsheets have confirmed that developers make uncorrected errors in 2%-5% of all formulas… Consequently, nearly all large spreadsheets are wrong and in fact have multiple errors… Spreadsheet error rates are very similar to those in traditional programming.”1 Be sure to test your spreadsheet calculations. Testing should consume 25%-40% of development time. You should be careful when using any Excel function as the formula used for calculation may be different from what you expect. Never assume that a function calculates something the way you think it does. Always verify the formula using the help files. READ THIS: Spreadsheet Error Horror Stories 1.2 Spreadsheet Quality and Testing The outcome of a modeling project can be: a.) A working piece of software deployed to drive business decisions; b.) A prototype, which is a temporary working implementation to be replaced, or used simply as c.) A requirements specification for software development in, for example, Python. In these cases, prototyping in Excel provides for: 1 Panko, Raymond R. 2006. “Recommended Practices for Spreadsheet Testing.” EuSpRIG 2006 Conference Proceedings. p. 73 ff. © 2024 Ben Van Vliet 9 1. Clear definitions of algorithms. Building a prototype forces you to clearly define and formulate calculations. If you cannot explain the system well enough to build a prototype, then there is little chance that they will be able to build it. 2. Clear definition of data requirements. Prototypes bring into focus what data inputs are needed for implementation and what outputs a working model will produce. 3. Requirements for graphical user interfaces. This forces discussion about how humans will interact with the software. The purpose of user interface (UI) prototyping is to ensure that the end user can change dynamic inputs and states, shut down, turn on/off, and select parameters. 4. A working application for regression testing. You will need your Excel prototype to test your Python implementation. The primary reason for creating an Excel prototype is to resolve uncertainties early in the modeling process: is it feasible? can it be built? An Excel prototype is useful for revealing and resolving ambiguity and incompleteness in the requirements. The presence of unclear requirements may create an expectation gap between the management’s vision and your vision. Prototyping makes requirements tangible by bringing use-cases to life and closes gaps in understanding of requirements. (Financial engineers may be the only ones able to understand the equations and symbols. Prototyping helps explain complex calculations to less quantitative team members.) Here are some tips toward effective prototyping: Include internal or external documentation, consisting of: o A brief description of the model and who built it. o The version number. o Details about the input data used in the model. o Summary of model change requests, i.e. a model change control document. o Summary of logical assumptions. For all members of your team, use the same style conventions, e.g. font, color coding, size, etc. © 2024 Ben Van Vliet 10 Focus initial prototyping efforts on poorly understood or risky areas. That is, prototype the hardest parts first. That way, if you can’t get it to work, you don’t waste time on the easy stuff. Remember that an ounce of prevention is worth a pound of cure. Reworking defective models can waste a lot of time. As a rule of thumb, time spent on prevention pays off (at least) 3-to-1 versus time spent on fixing defects.2 You can prevent errors by using spreadsheet design best practices. The six golden rules of spreadsheet design are: 1. Separate inputs, calculations and results. Place inputs in a separate part of the worksheet from calculations. Outputs should be in another part still. 2. Use one formula per row or column. 3. Refer to the above and to the left. 4. Use multiple worksheets for ease of expansion and for repeatable blocks. 5. Use each column for the same purpose throughout the model. 6. Include a documentation sheet that identifies key formulae. Grossman and Ozluk add the following: 1. Keep formulas simple to enable checking of intermediate calculations. 2. Use named cells and ranges. 3. Enter each input once. 4. Use absolute references. 5. Use cell-protection to prevent accidental modification of formulas. 6. Make the spreadsheet file read-only to prevent accidental overwriting. Also, we recommend enhancing the look of the spreadsheet with bold fonts, borders, shading, and graphs when displaying results. This will make the spreadsheet easier to read. © 2024 Ben Van Vliet 11 If your Excel spreadsheet is deployed, you will want to prevent users from introducing errors. I recommend that at a minimum you employ Tom Grossman’s six features: 1. Protect source code from user changes. 2. Prevent user customization. 3. Restrict linking user spreadsheets into the application. 4. Prevent users from sharing the application. 5. Provide easy control of protection features using a well-designed user interface. 6. Prevent user tampering with protection features by employing passwords.9 Following these guidelines will at least make it very difficult for users to mess around with the code and introduce errors or untested modifications that you might otherwise get blamed for. 1.2.1 Test-driven Development Test-driven development (TDD) is a software development methodology that relies on the repetition of, or iteration over, a three-stage cycle: 1. Create test cases that define desired functionalities. 2. Write the minimum amount of code that passes the test. 3. Refactor the code to acceptable standards, if necessary. That is, before you start your modeling project, you have to be able to answer the question: “How am I going to test this software?” 1.2.2 Spreadsheet Testing Methods In spreadsheet testing, you try several input values to see if they produce the correct results. © 2024 Ben Van Vliet 12 How do you know your spreadsheet is correct? This is the oracle problem. An oracle is a way to determine whether a piece of software passes a test. Given a test case (i.e. set of inputs), we compare the output of our software to the output of the oracle. There are three ways to solve the oracle problem: 1. Find an oracle, a credible example, either in a book or a research paper or a website that you can replicate in Excel. If your Excel code generates the same results as the oracle, you can be fairly confident your Excel code is correct. 2. If no oracle exists, then create, audit, and test an Excel spreadsheet implementation which you have a high degree of confidence in. 3. If no oracle exists, generate simulated data using known parameters, then develop your model and run it against the data. If the model returns the same parameters known to be in the data, you can be fairly confident your code is correct. In the next chapter, we will begin to develop Python code. How do you know your Python code is correct? If your Python code generates the same results as your Excel spreadsheet (which generates the same results as your oracle), you can be fairly confident your Python code is correct. Regression testing is what you do after you make a change to a spreadsheet of Python code to ensure that it works the same as before. You use a suite of input values previously used in the model and compare the values from before the software was changed. The previous version of the spreadsheet is the oracle. In spreadsheet inspection, you examine all formula cells. Logic inspection before you convert your model to Python allows errors to be detected while the cost of fixing them is small. If you’re working on a team, team inspection can be used, where you check each other’s work. © 2024 Ben Van Vliet 13 As said, testing, in its various forms, should consume 25%-40% of your model development time. But, realize that all these preventative measures and testing methods can never be assumed to eliminate all errors. That is why I encourage iterative development. The more iterations, the more opportunities for uncovering and removing errors. 1.3 Excel Syntax There are many, many built-in functions in Excel. If you are unfamiliar with them, click on the fx icon to peruse all the functions. You do not have to memorize anything. Some of the Excel functions are related to financial topics, for example, IRR(), DURATION(), NPV(), PMT(), and ACCINT(). I’m not really interested in these. Here is a brief list of built-in Excel functions you should familiarize yourself with: AVERAGE COVAR EXP LOG NORM.S.DIST SUM CORREL DATE IF MAX/MIN RAND TRANSPOSE SQRT DAYS360 INTERCEPT MINVERSE SLOPE VAR.S COUPDAYSBS DURATION LINEST MMULT STDEV.S VLOOKUP © 2024 Ben Van Vliet 14 You should look at the descriptions and help files of these functions as well as the many others. You will not be able to memorize all the Excel functions, so the key is to know what kinds of functions are available in Excel and find them quickly. To use a function, simply type = into cell, then the function name, then the parameters, or input arguments. For example: = EXP( 5 ) The return value of the function will appear in the cell. Some functions require an array of data as a parameter. This is accomplished using ranges. Usually we set these parameters by pointing and clicking on a cell, rather than by typing the code. For example, given values in cells A1 through A5, the sum of these values can be calculated in cell B1 as: = SUM( A1:A5 ) Some functions accept more than one parameter, which are separated by commas. If a cell used as an input argument is in another worksheet, the cell reference is preceded by the sheet name. For example: = IF( Sheet1!A1 > 0, 1, 0 ) Matrix functions often return arrays. To make this work, highlight the entire range, type in the function, and press Control-Shift-Enter. For example, given a matrix of values in cells A1 to B2, we can put the transposed matrix into cells C1 to D2 by highlighting C1 to D2, entering the formula, and then pressing Control-Shift-Enter: = TRANSPOSE( A1:B2 ) © 2024 Ben Van Vliet 15 Other matrix functions include MMULT, MINVERSE, and MDETERM. 1.3.1 Working with Cells So far we have looked at absolute cell referencing. An absolute reference uses the column letter and row number. There is also relative cell referencing, usually a bad idea, shows the number of rows and columns up and to the left of the current cell. Often times we need to copy formulas over some range. This is easily done, by left-clicking on the square in lower right-hand corner of the highlighted cell and dragging downward. In this example, cell B1 contains the formula: = SUM( $A$1:A1 ) After copying this formula down, cell B8 contains: = SUM( $A$1:A8 ) Notice that the first cell in range reference is locked using the dollar signs, $A$1. This means that as you copy formulas to adjacent cells, neither the column nor the row in the first reference will change. Clicking on F4 iterates through the four possible © 2024 Ben Van Vliet 16 combinations of fixed cell references—A1, no lock; $A1, only column locked; A$1, only row locked; and $A$1, both column and row locked. Sometimes we like to use named ranges (which I think are usually a bad idea). Rather than using cell references then, we can use the range name as a parameter instead. For example, we can name cell A1 as Rate in the name box. Then, in cell B2 we can use this value as a parameter thusly: = EXP( Rate ) For example, take present value (PV), which is the value today of a series of cash flows received in the future. CashFlow Present Value 1 Rate time The cash flows (CF) in the future are less valuable than the same amount of cash today, and the discount rate adjusts for the riskiness of the future cash flow. The higher the risk, the higher the discount rate. You can always do your own math in Excel using operators to implement a formula. For example, the present value equation can be implemented as follows: = 100 / ( 1.08 ) ^.25 The net present value (NPV) is the present value minus the cost of acquiring the asset, essentially just a negative cashflow at t = 0. But, Excel has an NVP function built in. Be careful! Never assume Excel calculates something the way you think it does. © 2024 Ben Van Vliet 17 n CFt Net Present Value CF0 t 1 (1 r )t In this example, the investment is worth taking if an only if the initial investment is less than NPV. That is, NPV > 0 is the decision rule. 1.3.2 Working with Data Often times we use Excel to store data. This is convenient especially in light of Excel’s look-up functionality, which searches for a value in a data table based upon a condition. Here is a simple example borrowed from Investopedia2: A B C D E F 1 Data U.S. Treasury Rate Bond Benchmark Benchmark Yield 2 3 2 Year 3.04 XYZ Corp. 30 Year 4.63 4 5 Year 3.86 ABC Inc. 2 Year 3.04 5 10 Year 4.14 PDQ & Co. 5 Year 3.86 6 30 Year 4.63 MNO, Ltd. 10 Year 4.14 2 See “Microsoft Excel Features For The Financially Literate,” by Barry Nielsen, CFA. © 2024 Ben Van Vliet 18 In this example, cell F3 contains the following formula: = VLOOKUP( E2,$A$3:$B$6, 2, False ) In VLOOKUP function call (V stands for vertical, there is also HLOOKUP for horizontal), E2 is the look-up condition. A3:B6 is the table range. 2 indicates to compare the look-up condition to column one in the data table and return the corresponding value from column two. True as the last parameter compares on an exact or approximate match is returned. False returns only an exact match. The values in the data table must be in ascending order, or it may not work right. © 2024 Ben Van Vliet 19 1.3.3 Working with Dates and Times If we enter a date into Excel, say 01/09/2011, we can format its appearance in several ways by right-clicking and selecting Format Cells… However, Excel itself keeps track of the data as an integer value. We can use Excel’s built-in date functions to perform date calculations. For example, to find the amount of time between two dates: A 1 1/9/2011 2 7/18/2011 3 190 4.5205 Cell A3 contains the number of days between the two dates using either: = A2 – A1 = DATEDIF( A1, A2, "d" ) The formula in cell A4 is: = YEARFRAC( A1, A2, 3 ) For information on the parameters for the YEARFRAC function, see the Excel help files. © 2024 Ben Van Vliet 20 1.3.4 Other Functionality You should also familiarize yourself with other capabilities of the Excel visual development environment. Most often, there is more than one way to accomplish any particular task. And, many times there are wizards and visual cues to walk you through development. For example, some of the following are also available by highlighting an Excel range and right-clicking. What How Description File Menu or Saving Files Opening and saving Excel files Toolbar Edit Menu or Copy / Paste Editing cells and worksheets Toolbar Format Menu or Cell Formatting Changing cell appearance, also setting decimal style Toolbar Excel Options Tools | Options Changing default Excel settings Charts Chart Wizard Creating charts in Excel using Insert | Charts Sorting Data Data | Sort or Icon Sorting data ascending or descending Auditing Formulas | Trace… Trace cells that are precedents and dependents Developer | Visual VBA Editor Launch the VBA development environment Basic Editor We will look at more Excel functionalities over the course of this study guide. © 2024 Ben Van Vliet 21 1.3.5 Getting Price Data from Yahoo! Finance Historical stock price data is available for free on Yahoo! Finance. To get data: Go to Yahoo! Finance. Type in the symbol IBM. Click on Historical Prices. Select a date range, say Jan 1, 2023 to Dec 31, 2023. Check Daily data. Click Get Prices. At the bottom, click Download to Spreadsheet and save to IBM.csv. This.csv file will open automatically in Excel. 1.3.6 Histograms in Excel A histogram is a bar chart that shows the frequency distribution of a data set. That is, the height of each bar in a histogram is determined by the number of data points in the data set that fall into each bin. The total area of the histogram is the number of data points. In Excel, one way to calculate the widths of the bins particularly for continuous data or discrete data where there are a large number of possible outcomes is: max( x) min( x) bin size desired number of bins Usually, for continuous data or a data where there is a large number of possible discrete outcomes, I find that between 20 and 30 bins works out well. Consider the following random data: A B C D E 1 6 7 5 9 8 2 8 8 3 0 3 3 1 7 6 5 3 © 2024 Ben Van Vliet 22 4 6 2 3 4 3 5 6 8 8 8 1 And the following discrete bins: F 1 Bins 2 0 3 1 4 2 5 3 6 4 7 5 8 6 9 7 10 8 11 9 Click in Data | Data Analysis | Histogram. Populate the histogram window as shown: The output shows the frequency and cumulative percentage distributions. G H I 1 Bin Frequency Cumulative % 2 0 1 4.00% 3 1 2 12.00% 4 2 1 16.00% © 2024 Ben Van Vliet 23 5 3 5 36.00% 6 4 1 40.00% 7 5 2 48.00% 8 6 4 64.00% 9 7 2 72.00% 10 8 6 96.00% 11 9 1 100.00% 12 More 0 100.00% Notice in the chart output that the cumulative percentage (F( x )) has a range of 0 to 1. Histogram 7 120.00% 6 100.00% 5 Frequency 80.00% 4 60.00% 3 40.00% Frequency 2 Cumulative % 1 20.00% 0 0.00% Bin If we divide the frequency of each outcome by the total count of (in this case) n = 25, we get the percentage that each outcome in the sample represents. The sum of these percentages, or what will be probabilities, is equal to one. We end up with a probability distribution. We often try to fit a probability density function, f ( x ), to the data in a probability distribution. From there, the cumulative density function, F( x ), can be estimated through summation of the relative frequencies. 1.4 Descriptive Statistics Recall that in a probability distribution, the first raw moment is the mean, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis. Consider the following example spreadsheet data. © 2024 Ben Van Vliet 24 A B 1 1 2 2 2 -1 3 -2 -2 4 0 1 5 -1 0 6 2 2 Measures of Central Tendency: Mean: The mathematical average which is the sum of all the values divided by the number of values. Median: The value that lies in the middle of all values after sorting low to high. Mode: The most commonly occurring value. = AVERAGE(A1:A6) = MEDIAN(A1:A6) = MODE(A1:A6) Measures of Dispersion (or Width): Range: The difference between the max value and the min. Quantile: also called Percentile, and often defined as Quartiles (fourths), Quintiles (fifths), Deciles (tenths), etc. The value at which some percentage of values below that value falls. Interquartile Range (IQR) is the difference between the third quartile and the first quartile, so that 50% of the values lies in the range. Variance: The squared deviation of each value from the mean. That is, it is the average of all the squared differences from the mean. Standard Deviation: The square root of the variance. It is used in finance as a proxy for volatility or risk. © 2024 Ben Van Vliet 25 = MAX(A1:A6) - MIN(A1:A6) ‘ Find the range = PERCENTILE( A1:A6, 0.5 ) ‘ Value at the 50% percentile = VAR.S(A1:A6) ‘ Calculate the sample variance = STDEV.P(A1:A6) ‘ Calculate the population standard deviation Measures of Shape: Skew: A measure of the asymmetry in the distribution either left or right. Usually, if median > mean, then the distribution is negatively skewed. If median < mean, then the distribution is positively skewed. This is not always the case, but usually. Kurtosis: A measure of the weight of the tails of the distribution relative to the mean. The standard normal distribution has kurtosis of 3. We are often interested in the excess kurtosis, relative to that of the standard normal. = SKEW.P(A1:A6) = KURT(A1:A6) © 2024 Ben Van Vliet 26 As a finance example, consider the compound interest equation A = P( 1 + r )t, where n is the number of times the interest compounds per year. As n “goes to infinity” or n → ∞: nt r e lim 1 rt n n Then the continuous compound interest formula that we use for stocks is: St Soe rt Solving for r, the log return (where t usually is 1), gives us: r ln( St / S0 ) If we make a histogram of log returns over time, it will look something like this: The reality is that the histograms of log returns r in financial markets are leptokurtotic. Measures of Dependence: Covariance: A measure of the degree to which two random variables (i.e. returns on two assets) tend to deviate from their own means in tandem. Correlation: Pearson’s correlation is a linear measure of the degree to which two variables tend to move together. = COVARIANCE.P(A1:A6,B1:B6) = CORREL(A1:A6,B1:B6) © 2024 Ben Van Vliet 27 Spearman's Rank Correlation: A non-parametric measure of statistical dependence between two random variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other. A simple procedure is normally used to calculate Spearman’s correlation. The n raw scores are converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation on the two variables are calculated. If there are no tied ranks, then ρ is given by: 6 d i2 1 n (n 2 1) = RANK.AVG(A1,$A$1:$A$6,0) ‘ C1 to C6 = RANK.AVG(B1,$B$1:$B$6,0) ‘ D1 to D6 = C1-D1 ‘ E1 to E6 = E1^2 ‘ F1 to F6 A B C D E F 1 1 2 3 1.5 1.5 2.25 2 2 -1 1.5 5 -3.5 12.25 3 -2 -2 6 6 0 0 4 0 1 4 3 1 1 5 -1 0 5 4 1 1 6 2 2 1.5 1.5 0 0 7 Sum: 16.5 In the data in both columns A and B, 2 is repeated. In columns C and D, the RANK.AVG function assigns the average rank of 1.5 for both. Then, the Spearman’s rank correlation is: 6 16.5 1 0.5286 6 (36 1) © 2024 Ben Van Vliet 28 LAB 1.1: Descriptive Statistics of Stock Returns Get price data for Microsoft (MSFT) and Intel (INTC), 1/1/2015 through 12/31/2023 from Yahoo! Finance. Calculate the daily log returns for each stock, where the log return for day t is the natural log of the closing stock price S on day t divided by the closing stock price S the day before t – 1. rt ln(St / St 1 ) Calculate the average daily return for each. Calculate the total return for each stock (hint: log returns are additive). Calculate the daily variances and standard deviations of returns. Calculate the annualized volatility of each stock (hint: variance grows linearly with the number of days, standard devitation grows with the square root of the number of days). Calculate the covariance and correlation of returns between the two stocks. Create a line chart showing the prices of each stock. Create a histogram of the returns. How do you know your answers are right? © 2024 Ben Van Vliet 29 CHAPTER 2 INTRODUCTION TO PYTHON 2.1 Getting Started with Python For this class we will be using: Anaconda Navigator 3.2 Spyder Editor 5.5.1 Python 3.12 You can install the Anaconda Individual Edition for free here. There are 3 ways we can write Python code: 1. In Console 1/A window, we can write command line code. © 2024 Ben Van Vliet 30 2. In the editor window (temp.py or untitled.py), we can write programs. 3. In the editor window, we can also write scripts. In this class, we will be writing scripts that encapsulate financial models. Before we run the script, it’s good practice to go to: On the menu bar, click Run | Configuration per file… Then, check “Remove all variables before execution” Save the file, and then run the program by clicking. The output of the script will show in the Console 1/A window: So, now let’s take a look at writing Python code. © 2024 Ben Van Vliet 31 Try not to get frustrated. Sometimes every line of Python code can feel like a research project. Google is your friend. There are millions of pages of help files and sample Python code on the internet. Getting good at getting help is a big part of being a good Python programmer. Simplicity and clarity are virtues. You write code for three audiences: 1. The compiler / interpreter, which won’t compile / run incorrect code. 2. Yourself, when you try to read your own code later. 3. Your customer, who wants to understand how your code works. None of these is going to be impressed by complex code that shows how smart you are (or think you are). Practice. Do not try to memorize code. The goal is not to know code. The goal is to use Python code to finish projects successfully. We all learn by hacking. So, hack away. There is no one right answer. For any coding project, there are an infinite number of ways to satisfy the requirements successfully in code. They can all be correct. Once you become more accomplished, you will learn that some solutions may be better than others—fewer lines of code, faster execution, clearer logic. In this class, we won’t worry about these. You can find The Python Tutorial here: docs.python.org/3/tutorial/ However, there are many, many other websites that have Python code examples and explanations. By and large, don’t waste your time reading about code. You learn to code by coding. If your fingers aren’t moving, you’re doing something wrong. Find code examples. Type them in. Get them to work. Repeat. There are thousands of open-source, third party packages for Python. Some of these packages have types and functionality important for financial modeling. It is important to note that there are many overlaps between packages, and therefore, many ways to accomplish the same task. Of course, we will not be able to look at all the ways to do things. © 2024 Ben Van Vliet 32 We will stick with the well-known packages. There are many, many packages that are wrong, or broken, or out of date. Some important packages that are included in Anaconda that we will be using are: numpy, for its array object; scipy, which is a collection of mathematical algorithms; matplotlib, which is used to produce graphs and figures; and pandas, which enables data analysis in Python. At times we may need to install new packages. REMEMBER: GOOGLE IS YOUR FRIEND We will also be using other types, packages, modules, and functions. You should Google them to investigate! PYTHON IS A GENERAL PURPOSE LANGUAGE It is relatively easy to write. It is very hard to read! Simplicity and clarity are virtues! REMEMBER: CLICK THIS “ON” Run Configuration per file… Remove all variables before execution 2.2 Memory Management Python use objects, which include simple variables—integers, strings, Booleans— instances of classes, and data structures—lists, numpy arrays, dictionaries. Be aware that Python is: Dynamically-typed, which means that every object name is bound at run-time to a type, unless it’s null. Types are not checked at design time or by a compiler. Strongly-typed, which means that objects cannot be coerced into unrelated types implicitly, say from “9” (string) to 9 (int). You need an explicit conversion to do that. © 2024 Ben Van Vliet 33 Implicitly-typed, which means it uses type inferencing to determine types rather than explicit declarations. This will drive you crazy. Consider the simple line of code a = 7 in Python. Here the thing “a” itself is not an object. Rather, it’s a reference (i.e. a pointer, if you like C/C++). All Python references are generic. They are all the same. What distinguishes one references from another is the type of the thing it refers (or points) to, or (we say) are bound to at any particular time. A reference holds the address of a dynamically allocated object of some type. And, the thing it refers to can change. a=7 print( type( a ) ) a = "Hello" print( type( a ) ) b=a a = False print( a ) print( b ) In this case, the reference “a“ refers to (or is bound to) 7, which is of type integer. Then, a = “Hello”, which is a string. Then, a = False, a Boolean. © 2024 Ben Van Vliet 34 Nevertheless, when a reference is bound to a type, for ease of talking about things, we will simply say “a is an integer,” when really what we mean is “a refers to an integer.” In Python, the most commonly used data types are: Example Python Code Inferred Data Type x = "Hello World" string x = 20 int x = 20.5 float x = 1j complex x = [ "apple", "banana", "cherry“ ] list x = ( "apple", "banana", "cherry“ ) tuple x = range( 6 ) range x = { "name" : "John", "age" : 36 } dictionary x = { "apple", "banana", "cherry“ } set x = True bool When an object is no longer being referred to, it will be deleted, or deallocated. This happens through a process called garbage collection, which happens through reference counts. Whenever you create an object in Python, it has a reference count. An object’s reference count gets incremented whenever it is being referenced, and the count gets decremented it gets dereferenced. If the reference count gets to 0, the object is deallocated. 2.1.1 Simple Syntax Example There are millions of web pages that give examples in Python. You should get used to the ones you like. # import the math package import math # math package contains many functions such as # pow, exp, log (for natural log), sqrt, pi, e # define a function © 2024 Ben Van Vliet 35 def add( x, y = 5 ): # x and y are the parameters, and y has a default value return x + y # the body of the function is determined by indents # create a variable and use the math package x = math.pow( 2 + 4, 2 ) # define a variable. # x is implicitly typed. # print x print( x ) # call the function and print the return value print( add( 3 ) ) 2.1.2 Control Structures Code Example Let’s take a look at the if…else, for loop, and while loop syntaxes in Python. Add the following code to that above. a=5 b=2 # if...else statement syntax print( "\nIF...ELSE:\n" ) # \n is the line return character if a >= 5: print( "\tTRUE!" ) else: print( "\tFALSE!" ) # for loop syntax print( "\nFOR LOOP:\n" ) for i in range( 1, 10 ): print( i ) © 2024 Ben Van Vliet 36 # while loop syntax print( "\nWHILE LOOP:\n" ) while a < 10: print( "\t" + str( a ) ) # \t is the tab character a, b = b, a + b 2.2 Python Data Structures In this class we will look at the relevant in-memory data structures in Python and the ways we will be downloading data, reading data files, and accessing various data sources. In order to manipulate data in Python you must understand how data is represented in memory. Under the hood of Python is C++. In C++ an array is a set of contiguous memory locations. Assume that we have an array of integers { 7, 2, 6, 8, 1, 3 }, where the size of an integer in memory is 4 bytes. Each element in an array has three things: Notice that a 1D array has elements addressed 4 bytes apart. If we reshape this array in 2D with 2 rows and 3 columns, notice that the first dimension { 10, 22 } holds the address of the first element in each row. However, the underlying data has not changed addresses. It’s really just a matter of perception. © 2024 Ben Van Vliet 37 Likewise, if we reshape this array in 2D with 3 rows and 2 columns where the first dimension is now { 10, 18, 26 }, the data remains the same. Were we to ravel this 2D array, we would see the flattened array shows that the data is still in 1D. Try this simple script out that will show how this works using a numpy array. import numpy as np a = np.array( [ 7, 2, 6, 8, 1, 3 ] ) print( a ) print( a.shape ) # To use this array as a column vector a = a.reshape( 6, 1 ) print( a ) print( a.shape ) # To use this array as a row vector a = a.reshape( 1, 6 ) © 2024 Ben Van Vliet 38 print( a ) print( a.shape ) # We can reshape this into 2D a = a.reshape( 2, 3 ) print( a ) print( a.shape ) # Or, alternatively in 2D a = a.reshape( 3, 2 ) print( a ) print( a.shape ) # The raw data has not changed print( a.ravel() ) There are five data structures we will be using most often: lists, dictionaries, tuples, numpy arrays, and pandas dataframes. (While Python has an standard array type, we will not be using it. We always use numpy arrays.) Here is a very brief review of these five. 2.2.1 Lists [ ] Lists can contain elements of different types, though very often we use them to hold arrays of data. The notation is a little unsettling at times. Try these out. a = [ 3, 7, 2, 6, 4 ] print( a[ :2 ] ) print( a[ 3: ] ) print( a[ 2:4 ] ) © 2024 Ben Van Vliet 39 b = [ [ 4, 9, 2 ], [ 6, 5, 1 ], [ 8, 3, 7 ] ] print( b[ 1 ] ) print( b[ :2 ] ) print( b[ 1: ] ) 2.2.2 Numpy Arrays Numpy’s main object is the homogeneous, multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In numpy dimensions are called axes. For example, the coordinates of a point in 3D space [1, 2, 1] is a 1D array. It has one axis. That axis has 3 elements in it, so we say it has a length of 3. A 2-axis array has a first axis (rows) of length 2, and the second axis of length 3: [[1., 0., 0.], [0., 1., 2.]] Numpy also provides familiar math functions such as sin, cos, and exp. Numpy arrays support vectorized operations, such as element-by-element addition, so we don’t have to write for loops. In other words, numpy arrays are fast. We can define the data type for an array, but it can also be inferred, always as the minimum type required. It only takes one floating point number to require a float type array. import numpy as np # b is an integer array from a list b = np.array( [ 2, 4, 6 ] ) # Multiply b by 2 b *= 2 print( b ) print( b[ 1 ] ) © 2024 Ben Van Vliet 40 print( b[ :-1 ] ) print( b[ :2 ] ) print( b[ 1: ] ) print( b ) print( sum( b ) ) print( b[ 1: ] - b[ :-1 ] ) Now, append the following. # c is a float array from a list c = np.array( [ 3, 5.3, 9 ] ) d=b+c print( type( d ) ) print( d ) How about 2D arrays? Append the following. # e is an array of 51 floats 0-10 e = np.linspace( 0, 10, 51 ) print( e ) # f is an array of 5 zeros f = np.zeros( 5 ) print( f ) # f is an array of floats g=b*c print( g ) Now try this. # h is a 2D integer array from a 2D list © 2024 Ben Van Vliet 41 h = np.array( [ [ 1, 4, 5, 8 ], [ 9, 7, 4, 3 ] ] ) # i is a 2D array of ones as floats i = np.ones( ( 3, 4 ), dtype = float ) print( i ) # j is a 4x4 identity matrix j = np.eye( 4 ) print( j ) # k is h transposed k = h.T print( k ) What about matrix multiplication? # WARNING: Do not use * to do matrix multiplication # Use the dot() function l = np.dot( h, j ) print( l ) print( l.shape ) Here are some helpful techniques. # looping thru arrays uses this syntax m = sum( [ x ** 2 for x in b ] ) print( m ) # apply natural log function to elements in an array n = np.log( b ) print( n ) © 2024 Ben Van Vliet 42 2.2.3 Dictionaries { } A dictionary is an unordered collection of objects (values), “indexed” by unique identifiers (key), which can be strings or numbers. We say that each element in a dictionary is a key-and-value pair. data = { "IBM":[ 100, 101, 97, 99 ], "MSFT":[ 55, 59, 54, 56 ] } print( type( data ) ) print( data.keys() ) print( data.values() ) print( data[ "IBM" ] ) 2.2.4 Tuples A tuple consists of immutable values of different types separated by commas. However, a tuple may contain mutable types. t = 'IBM', 145.34, 14500 print( t[ 1 ] ) v = ( [ 1, 2, 3 ], [ 4, 5, 6 ] ) v[ 1 ][ 0 ] = 'X' print( v ) 2.2.5 Pandas Dataframe A Pandas dataframe is a two-dimensional, size-mutable table (i.e. rows and columns, or records and fields), where the columns can be of different data types. Let’s create a dataframe from the ground up. Suppose we want to manage time series stock © 2024 Ben Van Vliet 43 price data—symbol, date, closing price, and volume—in memory. (Don’t worry about the actual data for now.) Notice the various data types listed in the output of this program. import pandas as pd import numpy as np # Create some lists bardata = [ 1, 2, 3, 4 ] symbol = [ 'IBM' ] * 4 close = [ 100., 101., 99., 100 ] volume = [ 100 ] * 2 + [ 200 ] * 2 date = pd.date_range( '20170103', periods = 4 ) # Create a dictionary bardata = { 'Symbol':symbol, 'Date':date, 'Close':close, 'Volume':volume } # Convert the array to a dataframe df = pd.DataFrame( bardata ) # This function returns the first n rows for the object based on position print( '\nDATAFRAME:\n' ) print( df ) print( '\nTYPES IN THE DATAFRAME:\n' ) print( df.dtypes ) # Get a column from a dataframe vs = df[ 'Close' ] print( '\nCLOSE COLUMN:\n' ) print( vs ) # Add a new column to a dataframe with the log returns df[ 'Return' ] = np.insert( np.diff( np.log( df[ 'Close' ].tolist( ) ) ), 0, [ 0 ] ) print( '\nDATAFRAME WITH NEW COLUMN:\n' ) © 2024 Ben Van Vliet 44 print( df ) 2.3 Python Data Sources You should know how to access data in Python from: Comma Separated Value (.csv) file Excel (.xlsx) file Numpy (.npy) file Pickle (.pkl) file Pandas DataReader 2.3.1 Reading.csv and.xlsx Files You can read.csv datafiles into dataframes easily. Download some historical data from Yahoo Finance and read it into a dataframe. import pandas as pd # Read a.csv file into a dataframe df = pd.read_csv( 'C:\\Python\\IBM.csv', sep = ',' ) print( df.head( 5 ) ) # Write a.csv file from a dataframe df.to_csv( 'C:\\Python\\IBM_backup.csv', sep = ',', index = False ) There are other ways to read.csv files. However, downloading data into a dataframe gives us the simplest method. However, sometimes there may be too much © 2024 Ben Van Vliet 45 data to load into memory. What if the data is a million rows? In this case we forward read the data, one row at a time. import csv with open( 'C:\\Python\\IBM.csv' ) as csvDataFile: csvReader = csv.reader( csvDataFile ) for row in csvReader: print( row ) We can also read and write to Excel files. # Reading an Excel file to a dataframe. # there is only one worksheet in the Excel file import pandas as pd df = pd.read_excel( 'C:\\Python\\IBM.xlsx', sheet_name = 'Sheet1' ) print( df.head( 5 ) ) © 2024 Ben Van Vliet 46 LAB 2.2: Descriptive Statistics in Python Using the returns for MSFT and Intel INTC from LAB 1.1, in Python: Calculate the daily returns for each stock using the continuous compounding formula. Calculate the average daily return for each. Calculate the daily variances and standard deviations of returns. Calculate the covariance and correlation of returns between the two stocks. How do you know your answers are right? import numpy as np import pandas as pd # read the MSFT data into a dataframe df1 = pd.read_csv( "C:\Python\MSFT.csv" ) # add a column to the dataframe and calculate the log returns df1[ 'log_ret' ] = np.log( df1[ 'Adj Close' ] / df1[ 'Adj Close' ].shift( 1 ) ) # print the descriptive statistics # numerical values may need to be converted to strings for printing using str() print( "\nMSFT Statistics:\n" ) print( "MEAN:\t" + str( df1[ 'log_ret' ].mean() ) ) print( "VAR:\t" + str( df1[ 'log_ret' ].var( ddof = 0 ) ) ) print( "STDEV:\t" + str( df1[ 'log_ret' ].std( ddof = 0 ) ) ) print( "SKEW:\t" + str( df1[ 'log_ret' ].skew() ) ) print( "KURT:\t" + str( df1[ 'log_ret' ].kurt() ) ) rets1 = np.asarray( df1[ 'log_ret' ].dropna() ) # read the INTC data into another dataframe and calculate the log returns df2 = pd.read_csv( "C:\Python\INTC.csv" ) df2[ 'log_ret' ] = np.log( df2[ 'Adj Close' ] / df2[ 'Adj Close' ].shift( 1 ) ) © 2024 Ben Van Vliet 47 # convert the dataframe to a numpy array rets2 = np.asarray( df2[ 'log_ret' ].dropna() ) # combine the two arrays of returns into a single, 2D array rets_all = np.stack( ( rets1, rets2 ), axis = 0 ) # print out the covar and correlation matrices print( "\nCOVAR Matrix:\n" ) print( np.cov( rets_all, ddof = 0 ) ) print( "\nCORREL Matrix:\n" ) print( np.corrcoef( rets_all ) ) © 2024 Ben Van Vliet 48 LAB 2.4: Histograms in Python Consider again the same data from Chapter 1.2.8. import numpy as np import matplotlib.pyplot as plt data = np.array( [ 6, 7, 5, 9, 8, 8, 8, 3, 0, 3, 1, 7, 6, 5, 3, 6, 2, 3, 4, 3, 6, 8, 8, 8, 1 ] ) # plot the histogram of the probability distribution plt.hist( data, density = True, bins = 10 ) plt.hist( data, density = True, bins = 10, histtype='step', cumulative = True) plt.show() © 2024 Ben Van Vliet 49 2.3.2 Writing and Reading Numpy Files Numpy includes a simple file format for numpy arrays. Numpy files (.npy) store data, shape, dtype and other information required to reconstruct the array. import numpy as np a = np.array( [ 1, 2, 3, 4, 5 ] ) np.save( 'C:\\Python\\np_test', a ) b = np.load( 'C:\\Python\\np_test.npy' ) print( b ) © 2024 Ben Van Vliet 50 2.3.3 Working with Pickle Files Pickling is the Python process that serializes objects or data structures into byte streams which can be saved as files or sent over a network. Unpickling is the reverse, where a byte stream is deserialized into an object or data structure. There are three pickling protocols: 0) ASCII protocol; 1) the old binary format; and 2) which provides the most efficient way to pickle. HIGHEST_PROTOCOL always ensures use of the highest available version. When we create pickle files, we will use the.pkl extension. import pickle from datetime import date import numpy as np data = { 'IBM': [ date( 2018, 1, 26 ), 1.45, 'AA-' ], 'MSFT': [ date( 2018, 6, 14 ), 0.98, 'AA' ], 'GM': [ np.NaN, 0.0, 'BB+' ] } with open( 'C:\\Python\\data.pkl', 'wb') as outfile: pickle.dump( data, outfile, pickle.HIGHEST_PROTOCOL ) with open( 'C:\\Python\\data.pkl', 'rb') as infile: data = pickle.load( infile ) print( data ) © 2024 Ben Van Vliet 51 2.3.4 Working with Pandas Datareader The Pandas datareader can automatically populate a dataframe from various internet data sources. The list of data sources can be found here: https://pandas- datareader.readthedocs.io/en/latest/remote_data.html Be aware, however, that not all the data sources actually work for free. Some may require additional licensing or subscription fees to use them. For example, Yahoo’s API no longer permits free downloads. This example accesses the quandl data source for historical price data. import pandas as pd pd.core.common.is_list_like = pd.api.types.is_list_like from pandas_datareader import data as dp import datetime start1 = datetime.datetime(2017, 1, 1) end1 = datetime.datetime(2020, 1, 27) df = dp.DataReader('IBM', 'quandl', start=start1, end=end1) print( df ) This example accesses the St. Louis FED (FRED) for historical housing data. import pandas_datareader.data as data import datetime import numpy as np start = datetime.datetime(2010, 1, 1) end = datetime.datetime(2020, 1, 27) df = data.DataReader('HOUSTNSA', 'fred', start, end) print( df ) arr = np.array(df) © 2024 Ben Van Vliet 52 plt.plot(arr) plt.show() 2.6.5 Installing a New Package Step 1: In your Windows Start menu, open Anaconda Prompt. Typically, we always want to use the conda install tool to install new packages. The conda install string for linearmodels package is: conda install -c conda-forge linearmodels However, this didn’t work for me, so I used the pip install to instead. Step 2: At the command prompt, type the pip install string: pip install linearmodels Step 3: Check to see if the linearmodels package installed correctly by importing it in a Spyder python program. © 2024 Ben Van Vliet 53 CHAPTER 3 FINANCIAL DATA “The data is always the problem.” 3.1 Types of Databases in Finance We won’t look at all of them in this text, but there are at least six types of financial data we use in finance: Price Data consists of the bid and ask prices and quantities, and trade prices and quantities for securities and derivatives. Valuation Data is different from price data. For some financial instruments— bonds, swaps, and all OTC derivatives—no price data exists, or if it does, it is a highly guarded secret. For these, valuation data is all there is. That is, the price exists only in theory, and, furthermore, is not a firm bid or offer that any market maker is obliged to honor. Fundamental Data consists of everything that is disclosed in 10-Q quarterly and 10-K annual reports, including key business items, such as earnings, sales, inventories, and rents. Calculated Data is data that is calculated from fundamental data, such as ROE, price to book, beta, forecasted dividends, free cash flow, etc. Index values, such as the Dow Jones and S&P 500 are calculated and broadcast in real-time. Economic Data, such as CPI and GDP, are key indicators often used in financial analysis and trading. Unstructured Data consists of things like news articles, pictures, Twitter feeds, and product reviews. Sentiment Data quantifies the emotional content embedded in unstructured data. 3.2 Data Dictionary © 2024 Ben Van Vliet 54 As we have seen, we will be accessing data in different file formats. Regardless of the format, however, the data almost always comes in tables consisting of rows (or records) and columns (or fields). It is very important to create a data dictionary for every data set. A data dictionary contains the field names in a data set, along with their data types and descriptions. Here is an example of a data dictionary for the above data. Data Field Name Description Valid Values Type symbol string The symbol of the financial instrument. OIL, CHEM, IT, industry string The industry abbreviation. TRANS quantity int The number of shares owned. entry_price float The price the stock was bought at. The profit or loss since the stock was profit_dollar float bought. winlose char Whether the trade is a winner or a loser. W, L 3.3 Data Cleaning Good inputs are the key to success in financial modeling, and forecasting, as well as risk management, requires good, clean data for successful testing and simulation. Virtually no data, however, is perfect and financial engineers spend large amounts of time cleaning errors and resolving issues in data sets. It is very easy and very common to underestimate the amount of time preprocessing will take. © 2024 Ben Van Vliet 55 First, identify and categorize all the types of problems you expect to encounter in your data; then survey the available techniques to address those different types of errors; and finally develop methods to identify and resolve the problems. Data problems fall into one of four categories: Bad, or incorrect, data. Formatting problems. Outliers, which skew results. Point-in-time data problems. Types of Bad Data Example Bad values Tick of 23.54, should be 83.54 Missing data Blank field or data coded as “9999,” “NA,” or “0” Bad dates 2/14/12997 Column shift-data Value printed in an adjacent column File corruption Network errors Data from different vendors may come in different Different data formats formats or table schemas Missing fundamental data The company may have changed the release cycle Some things to look for algorithmically when attempting to clean a data set. Scanning for Bad Data Intra-period high less than closing price Intra-period low greater than opening price Volume less than zero Bars with wide high/low ranges relative to some previous time period Closing deviance. Divide the absolute value of the difference between each closing price and the previous closing price by the average of the preceding 20 absolute values Data falling on weekends or holidays Data with out-of-order dates or duplicate bars Price or volume greater that four standard deviations from rolling mean 3.3.1 Identify Required Cleaning Activities and Algorithms © 2024 Ben Van Vliet 56 All data, both real-time and historical, should be assumed to contain errors and issues. For example, data issues for high frequency systems will center more on clean tick data, whereas those for systems with longer-term holding periods will focus more on, say, dividends and releases of and revisions to financial statements. 3.3.2 Bad Data and Outliers In all cases cleaning of bad data is a process that consists first of detection, then classification of the root cause of the error, and then correction of the error. Whatever method you use to accomplish these must be shown to operate on both historical and real- time data. This matters because data cleaning algorithms can add latency to real-time systems. Algorithms that cannot be performed in real time prior to trade selection should not be used on historical data. Since many models are sensitive to even a few bad data points, I recommend looking carefully at means, medians, standard deviations, histograms, and minimum and maximum values of time series data. A good way to do this is to sort or graph the data to highlight values outside an expected range, which may be good (but outlying) or bad data. Outliers may good or real data points, but may obliterate the underlying structure, or information, otherwise present in the data. Brownlees and Gallo (2005) propose the following: Construct a z-score heuristic and then eliminating (or “trimming”) observations if the absolute value of the score is beyond some threshold. The z-score is computed using a moving average, moving variance, and a parameter g, which induces a positive lower bound on the heuristic variance, which is useful for handling sequences of identical prices as: But, why not winsorize the outliers, rather than trim them? For example, set all outliers to a specified percentile of the data. Say, any data point above the 95%ile would be “pulled in,” or set to the value that is the 95%ile. © 2024 Ben Van Vliet 57 LAB 3.1: Winsorizing in Practice Consider the following data set: 100, 98, 105, 100, 138, 98, 101, 97, 104, 95, 54, 101, 104, 99, 100, 109, 105, 100, 95, 99 The boss is concerned that outliers in the data may ruin a forecasting model. Write a Python script that will winsorize the outliers to the 5 / 95%ile. As discussed, do it in Excel first. Graphing the data shows us that there are outliers. Now, write the Python code so that you get the same result as you did in Excel. import numpy as np from scipy.stats import norm data = np.array( [ 100, 98, 105, 100, 138, 98, 101, 97, 104, 95, 54, 101, 104, 99, 100, 109, 105, 100, 95, 99 ] ) © 2024 Ben Van Vliet 58 avg = data.mean() stdev = data.std() pct_5 = norm.ppf(.05, loc = avg, scale = stdev ) pct_95 = norm.ppf(.95, loc = avg, scale = stdev ) pct_5 = round( pct_5 ) pct_95 = round( pct_95 ) for x in np.nditer( data, op_flags = ['readwrite'] ): if x < pct_5: x[...] = pct_5 if x > pct_95: x[...] = pct_95 print( data ) To be perfectly honest, I had to use Google four times in order to get this to work right. © 2024 Ben Van Vliet 59 Cleaning historical data is a trade-off between under and over-cleaning. What is bad data to you may not be bad data to me. What is clean at 10-minute intervals may be dirty at 10-second intervals. There is no single, correct way to clean data that applies to all scenarios. Models built on over-cleaned historical data may have problems with real- time data that contains errors. The objective is to manage the trade-off to produce a historical data series where problems are dealt with, but where the real-time properties of the data are maintained. 3.3.3 The Point-in-Time Data Problem Even “cleaned” data can have problems. Consider the following scenario: stock price data for a trading day is cleaned after the close of business. The next day, your data vendor sends out an adjustment file reflecting the corrected, or cleaned, price data. This occurs regularly due to block trades and other off-exchange trades being reported late, all of which become part of the day’s historical price data. However, you didn’t have the cleaned data while you were trading that day. If you use the cleaned data for a model, that’s not the data you would have had at the time. Further, with the new, cleaned price data, many calculations for derivatives prices from the previous day, which were based on the uncleaned data, will now be wrong. Implied volatility calculations will be wrong. All the historical prices of OTC derivatives that are based on the stock’s implied volatility will now be wrong. End-of-day rebalancing algorithms could also be wrong due to the uncleaned data used to run the models. So, cleaned data may not be better. An algorithm for calculating end-of-day prices is very important, but not as easy as it sounds. At the close, block orders affect prices, traders sometimes push bids and asks around, and small orders can move the market in the absence of late liquidity. Market fragmentation (there are at least 60 execution venues for stocks in the U.S.) make it difficult to define a single closing price for a stock. To calculate a clean closing price for a stock, it may be advisable to get a non-official closing price five minutes prior to the close across selected exchanges. Throw out all locked or crossed bids and asks, and then determine the weighted mid-point price. Futures prices are generally fairly clean since contracts trade only on one exchange. Option prices are very difficult to deal with, © 2024 Ben Van Vliet 60 however, due to market fragmentation, liquidity issues, and issues with the prices of the underlyings. 3.3.4 Synchronizing Data Databases of the different types listed above have different updating periods. Take splits, for example. Price data vendors update daily. Balance sheet data vendors may update their databases weekly. As a result, a given ratio, such as sales-to-price, may contain an unsplit sales figure and a split price. Fixing this problem is called synchronizing the data, accomplished by either buying synchronized data from a vendor or performing the task in-house. The real key to synchronizing data, or blending data, is a data map, sometimes called a Rosetta Stone. A Rosetta Stone is the set of unique identifiers used to link data and instruments across databases from various vendors. A proper Rosetta Stone allows the trading or investment firm to trade many instruments—stock, options, bonds, CDs, and OTC products—on a single underlying instrument. Further, developing unique identifiers across underlyings and across vendors enables the blending of proprietary data with purchased data. 3.4 Rescaling Data 3.4.1 Normalizing The process of normalization rescales a dataset over the range [ 0, 1 ]. xi xmin xi ,norm xmax xmin While this may be useful in some cases, outliers will be lost. We use normalization when we want to put things in probability space, since probabilities range from 0 to 1. We can also normalize over some other range. Given raw data in column A, we can convert it to the new normalized score in column C. Start by sorting the raw data. © 2024 Ben Van Vliet 61 Given data x1 through xn where i = 1,..., n, the cumulative probabilities in column B are found as: xi xmin zi min (max min) xmax xmin In this next table, given the ranks of the raw data in column A, we can convert it to the new standardized score in column C. The Excel formulae for generating the data are the same as in Table 1. The difference between simple ranking and distribution fitting is that using ranks is like fitting to a uniform distribution. Figure 1: Simple Ranking Fits to a Uniform Distribution © 2024 Ben Van Vliet 62 As can been seen from Figure 1, two ranks in the neighborhood of P( a ) will map the appropriate distance apart, as will two points in the neighborhood of P( b ), because of the constant slope of F( x ) in a uniform distribution. 3.4.2 Standardizing Data More often, what we want to do is standardization, which rescales the data set to express the values as z-scores, their number of standard deviations from the mean. Thus, we have to remove the mean from the training data (i.e. center it) and scale it to unit variance. xi zi Suppose that, given raw fundamental data (e.g. earnings per share, price-to-book ratio, etc.), we wish to standardize the data by fitting it to the normal distribution in this way, but say between plus and minus 2 standard deviations. (The probabilities associated with this range are 2.275% and 97.725% (using Excel’s NORMSDIST() function)). Figure 2: Fitting Ranked Data to a Normal Distribution © 2024 Ben Van Vliet 63 Fitting the ranks to a normal distribution is different. As can be seen in Figure 2, two points in the neighborhood of P( a )—such as data points with ranks 1 and 2—will map further away than will two points in the neighborhood of P( b )—such as data with ranks 455 and 456—because the slope of F( x ) not constant, and is steeper at b than a. So, distribution fitting takes differences in the ranks of observations (or in some cases the observations themselves), and imposes a distributional prior as to how much importance gaps in neighboring observations should have. The distributional method determines the importance of outliers. import sklearn.preprocessing as pp import numpy as np data = np.array( [ -.50, -.25, -.22, -.18, 0.0,.10,.20 ] ) data = data.reshape( 7, 1 ) # STANDARDIZE the data by removing the mean and scaling to unit variance scaler = pp.StandardScaler() std_data = scaler.fit_transform( data ) print( '\nSTANDARDIZED DATA:\n' ) print( std_data ) # NORMALIZE the data by removing the mean and scaling each data point to a given range mm_scaler = pp.MinMaxScaler( feature_range = ( -2, 2 ) ) mm_data = mm_scaler.fit_transform( data ) print( '\nSTD MINMAX DATA:\n' ) print( mm_data ) # NORMALIZE the ranked data the same two ways ranks = np.array( [ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0 ] ) ranks = ranks.reshape( 7, 1 ) std_ranks = scaler.fit_transform( ranks ) print( '\nNORMALIZED RANKED DATA:\n' ) © 2024 Ben Van Vliet 64 print( std_ranks ) mm_ranks = mm_scaler.fit_transform( ranks ) print( '\nSTD MINMAX RANK DATA:\n' ) print( mm_ranks ) 3.4.3 Difference Bins A variation on the ranking theme is to scale the ranks by the difference between data points, so that points with larger differences between them have a correspondingly large gap between their ranks. This is typically done by placing the differences into bins. The steps are as follows: Step 1: Sort the data from low to high. Step 2: Find the differences between points. Step 3: Put the differences into bins 1…m according to their size. That is, small differences go into bin 1, and the largest differences go into bin m. Step 4: To assign ranks to the data, give the smallest data point a rank of 1, and then add the bin value to each successive rank, so that each value gets assigned a rank that differs from the previous rank by the bin value. Thus, there will be gaps in the numbering. Step 5: Finally, proceed with distribution fitting as before. Here is a numerical example using 3 difference bins to illustrate this technique. A B C D Difference Raw 1 Raw Data Difference Bin Rank 2 -1 1 3 -.5.5 2 3 4 -.3.2 1 4 5 -.2.1 1 5 6 0.2 1 6 7.1.1 1 7 8.5.4 2 9 9 2 1.5 3 12 © 2024 Ben Van Vliet 65 In this table, given the sorted raw data in column A, we can convert it to the new raw ranks in column D. These raw ranks can be used as inputs into the previous example to generate a normalized score. 3.5 Ranking in Practice Financial data is often categorized for the purpose of generating factors, or indicators. Valuation factors are often scaled by the industry sector. Volatility factors are sometimes scaled by capitalization group. For cross-sectional analysis, sector and groups are often defined in fundamental data. For time-series analysis, data may be grouped by time periods, months, days, or intra-day periods. For example, we could look at the monthly ROE for the top decile stocks. If the distribution of data in each group, or sector, or category, or “bucket,” is different, then standardizing by group may not help much, unless you use a ranking method. Table 4 contains some sample data that should illustrate the value of ranking by groups. A B C D E F G H I 1 Group Raw Data 2 A 22 25 26 28 30 35 36 39 3 B 5 6 6 7 7 7 8 9 A B C D E F G H I 4 Group Raw Ranks 5 A 1 2 3 4 5 6 7 8 6 B 1 2 2 5 5 5 7 8 In the first table here, the data for group A clearly indicates a different distribution. By ranking the data by group, we can compare apples to apples. 3.6 Double Standardizing It is also possible to double standardize. That is, we standardize one way, then the other. In this case the order of standardization—cross-sectional first, or time-series first—is important to the final meaning of the factor. © 2024 Ben Van Vliet 66 For example, in the case of performing time-series standardization first, then cross-sectional standardization, consider the data for IBM: IBM Month Factor Data Z-Score March 1.02 -.23 April 2.21.96 May 1.00 -.25 June 3.52 2.27 After the time-series standardization, the factor data and z-score for June clearly appear to be unusually high. However, after a cross-sectional standardization, as can be seen next, most other stocks seem to also be high in June. Stocks June Symbol Z-Scores IBM 2.27 LUV 1.95 INTC 2.35 WMT 2.02 So, 2.27 is nothing special in terms of upward movement. © 2024 Ben Van Vliet 67 LAB 3.2: Ranking and Standardizing Given the following fundamental data and two stocks: A B 1 ABC XYZ 2 12.50 0.12 3 9.82 0.25 4 11.77 0.17 5 15.43 0.05 6 19.03 0.31 Rank the data, calculate the Spearman’s rank correlation, and then standardize the data. Fortunately, Python’s sklearn and scipy make this easy. from sklearn.preprocessing import StandardScaler import numpy as np from scipy import stats ABC = np.array( [ 12.50, 9.82, 11.77, 15.43, 19.03 ] ) XYZ = np.array( [ 0.19, 0.25, 0.17, 0.05, 0.23 ] ) # Rank the data ABC_rank = stats.rankdata( ABC ) XYZ_rank = stats.rankdata( XYZ ) print( '\nRANKED DATA:\n' ) print( ABC_rank ) print( XYZ_rank ) # Calculate Spearman's Rank Correlation spear_corr = stats.spearmanr( ABC, XYZ ) print( '\nSPEARMAN CORRELATION:\n' ) print( spear_corr.correlation ) # Standardize the data © 2024 Ben Van Vliet 68 scaler = StandardScaler() ABC_std = scaler.fit_transform( ABC.reshape( -1, 1 ) ) print( '\nSTANDARDIZED DATA:\n' ) print( ABC_std ) © 2024 Ben Van Vliet 69 3.7 Data Cleaning Checklist Pre-cleaning activities: 1. Make a copy of the raw data set, if possible. 2. Create a data dictionary. 3. If necessary, create a data map to document the translation between the raw data set and a new, redefined data set. 4. Check the statistical properties of the uncleaned data. 5. Graph the data, including boxplots and scatterplots, potentially. 6. Create histograms of the data. 7. Check for normality and seasonality. 8. Calculate the correlation matrix and create correlation plots Detection: 1. Prototype error/outlier detection algorithms in Excel. 2. Code and test Python detection algorithms. Classification: 1. Define what counts as an error, an outlier, seasonality, and what needs to be corrected, trimmed, or winsorized. 2. Prototype correction algorithms in Excel 3. Code and test Python classification and correction algorithms. Correction: 1. Check the statistical properties of the cleaned data set. 2. Graph the data, including boxplots and scatterplots, potentially. 3. Create histograms of the data. 4. Check for normality and seasonality. 5. Calculate the correlation matrix and create correlation plots © 2024 Ben Van Vliet 70 CHAPTER 4 DATABASES and SQL 4.1 Database Management Systems Financial analysis requires data—historical price data, fundamental data, calculated data, economic data, news data, sentiment data. Python is an efficient platform for creating financial applications that interact with databases. In general there are two types of databases used in financial markets: operational databases and analytical databases. Operational databases store dynamic data such as portfolio and position information. Analytical databases hold static data such as historical data. A database management system (DBMS) is a software package that controls the organization, storage, management, and retrieval of data. It provides the ability to access (or “query”), add, delete, and modify the data. The four types of DBMSs are: hierarchical, network, relational, and object models. The best one to use depends upon the particular data and the user needs. The user typically balances transaction rate (speed), reliability, maintainability, scalability, and cost. Of concern for us, is that both operational and analytical databases we will use the relational database model (RDM). The most popular large-scale RDBMSs are from Microsoft SQL Server, Oracle, and Sybase. In the RDM, data is held in tables, which are made up of columns, called fields, and rows, called records. A field, or column, represents a characteristic of a record, and have names, data types, and (usually) lengths. Data in databases can be alphanumeric, numeric, or date/time. Also, fields can contain distinct or multipart values and may have values that are calculated. A record, or row, holds the actual data in a table. A single record in a table is made up of one row containing all of the columns in the table including a primary key that uniquely identifies a record. Connections between different tables are defined in relationships, which ensure data integrity, consistency, and accuracy. Relationships happen through keys. A primary key is a special field in a table that uniquely identifies a record. Now two records in the table can have the same value for the primary key field. Every table in a relational database must have a primary key and no two tables should have the same primary key. © 2024 Ben Van Vliet 71 Foreign keys establish relationships between pairs of tables. Relationships between two tables arise when the primary key column in one table is identical to the foreign key column in the other. A basic difference between primary key and foreign key is foreign keys allow null values and duplicate values, and it refers to a primary key in another table. A relationship is said to be one-to-one if a single record in the first table is related to a single record in the second table, and vice versa. A relationship is said to be one-to- many if a single record in the first table can be related to several records in the second table, but at the same time a single record in the second table can only be related to only a single record in the first table. A relationship is said to be many-to-many if a single record in the first table is related to many records in the second table, and vice versa. In the case of a many-to-many relationship, we need to create a linking table by copying the primary key from each table into the new table. A schema defines the formal, or even theoretical, structure of a database. Its how the the database is constructed, or conceptualized, using tables, fields, relationships, etc. Integrity constraints ensure compatibility between different parts of the schema, or structure. A schema can contain formulas that represent the integrity constraints for a specific application. We often represent a database’s schema graphically. Normalization is a process of analyzing the database schemas to minimize redundancies and problems associated inserting, deleting and updating data. If a schema doesn’t accomplish these things, we often decomposed it into smaller schemas. Thus, normalization is often the process of breaking down a large table or tables into smaller tables. A normal form is a set of rules that test a table structure to ensure it is sound and free of errors. There are at least five normal forms used to test for specific sets of problems. Tables we will use are in at least third normal for since each one has a primary key that uniquely identifies each record. © 2024 Ben Van Vliet 72 LAB 4.1: Working with SQLite Databases SQLite is an open source RDBMS. You do not need to have SQLite software installed on your computer to interface with SQLite databases through Python. According the SQLite website (https://www.sqlite.org/about.html), “SQLite is an in- process package that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. [They claim that] SQLite is the most widely deployed database in the world. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file,” which has an.sqlite extension. In Python, we can create and access SQLite databases. Let’s create a test.sqlite database and use SQL statements to create a table, insert rows into the table, and query the database. import sqlite3 as lite import os # if the database already exists, don’t try to create it if not os.path.isfile( 'C:\\Python\\test.sqlite' ): con = lite.connect( 'C:\\Python\\test.sqlite' ) with con: cur = con.cursor() cur.execute("CREATE TABLE Prices(Symbol TEXT, Date DATE, Price FLOAT, Volume INT32)") cur.execute("INSERT INTO Prices VALUES('IBM', '01/02/2017', 101.1, 500 )" ) cur.execute("INSERT INTO Prices VALUES('WMT', '01/02/2017', 84.56, 400 )" ) cur.execute("INSERT INTO Prices VALUES('MSFT', '01/02/2017', 45.68, 1500 )" ) con.close() # fetchall gets a result set of an SQL query and returns a list of tuples con = lite.connect( 'C:\\Python\\test.sqlite' ) with con: © 2024 Ben Van Vliet 73 cur = con.cursor() cur.execute( 'SELECT * FROM Prices' ) rows = cur.fetchall() for row in rows: print( row ) con.close() © 2024 Ben Van Vliet 74 LAB 4.2: Working with the Finance.sqlite Database Finance.sqlite is analytical database that uses flat files to hold daily historical price data for 13 stocks and the S&P 500. The individual data tables in Finance.sqlite are named AXP, GE, GM, IBM, INTC, JNJ, KO, MCD, MO, MRK, MSFT, SUNW, WMT and SPX. In addition, there is a validation table named Tickers, which contains the 13 stock ticker symbols shown. The 14 data tables consist of the primary key column, the Date, and open, high, low, close, and volume fields. Date OpenPrice HighPrice LowPrice ClosePrice Volume 2-Jan-90 23.54 24.38 23.48 24.35 1760600 3-Jan-90 24.53 24.72 24.44 24.56 2369400 4-Jan-90 24.6 24.94 24.56 24.84 2423600 5-Jan-90 24.81 25.25 24.72 24.78 1893900 8-Jan-90 24.66 25.06 24.66 24.94 1159800 2-Jan-90 23.54 24.38 23.48 24.35 1760600 The Tickers validation table consists of a single column named Symbols, which holds the ticker symbols for each of the 13 stocks. Here is a sample of the Tickers table: Symbols AXP GE GM IBM etc. Let’s connect to Finance.sqlite. import sqlite3 as lite con = lite.connect( 'C:\\Python\\Finance.sqlite' ) with con: cur = con.cursor() cur.execute( 'SELECT * FROM IBM' ) rows = cur.fetchall() for row in rows: print( row ) © 2024 Ben Van Vliet 75 con.close() © 2024 Ben Van Vliet 76 LAB 4.3: Working with the Options.sqlite Database The Options.sqlite operational database uses a relational database structure to hold information about stocks and options as well as stock trades and option trades. In fact, there are four tables in the Options.sqlite database representing each of these things—Stocks, OptionContracts, StockTrades and OptionTrades. As we saw earlier, the relationships between two tables in a relational database are made possible by common primary and foreign keys. In Options.sqlite, for example, the Stock and StockTrades tables are related through a StockSymbol primary key in the Stock table and the foreign key StockSymbol column in the StockTrades table. Here is a diagram showing the structure or schema of the Option.sqlite database. In this diagram, the relationships are represented by arrows. © 2024 Ben Van Vliet 77 All of the relationships in the Options.sqlite database are one-to-many. As you may be able to gather from the diagram, a one-to-many relationship exists between the Stock and OptionContracts tables. Clearly, a single stock can have many options contracts on it. But in the opposite direction, it is not the same. A single option contract can have only one underlying stock associated with it. import sqlite3 as lite con = lite.connect( 'C:\\Python\\Options.sqlite' ) with con: cur = con.cursor() cur.execute( 'SELECT * FROM OptionTrades' ) rows = cur.fetchall() for row in rows: print( row ) con.close() © 2024 Ben Van Vliet 78 4.2 Structured Query Language Structured Query Language (SQL) is an ANSI/ISO-standard language for communication and interaction with databases. It was created to be a cross-platform syntax to extract and manipulate data from disparate database systems. So, in theory the same SQL queries written for an Oracle database will work on a Sybase database or an Access database and so on. However, database vendors have also developed their own versions of SQL such as Transact-SQL and Oracle’s PL/SQL which extend ANSI/ISO SQL. This chapter will focus on writing standard SQL and will not use any vendor specific SQL syntax. We can embed SQL statements into our Python programs to perform everything from simple data retrieval to high-level operations on databases. The SQL statement that we may most often be concerned with when developing quantitative trading or risk management systems are those that retrieve data, called data query language (DQL). However, we will at times also need to write, change or delete data in a database. These types of SQL statements are referred to as data manipulation language (DML). Also, however, SQL can be used to actually change the structure of the database itself. These SQL statements are called data definition language (DDL). Data control language (DCL) includes SQL queries like GRANT and REVOKE, which are commands used to provide and remove access privileges to database users. 4.2.1 Data Manipulation Language We use DML to retrieve and otherwise work with the actual data held within a database. SELECT Reading data is the most common task we want to perform against a database. A SELECT statement queries the database and retrieves selected data that matches the criteria that we specify. The SELECT statement has five main clauses, although a FROM © 2024 Ben Van Vliet 79 clause is the only required one. Each of the clauses has a wide array of options and parameters. A SELECT statement means that we want to choose columns from a table. When selecting multiple columns, a comma must delimit each of them except for the last column. Also, be aware that as with Python, SQL is not case sensitive. Upper or lower case letters will do just fine. Be aware too that most, but not all, databases require the SQL statement to be terminated by a semi-colon. Before we get too in depth, let’s create a Python program to test out the SQL statements we look at as we go along. Create a new Python file named SQL_example. import sqlite3 as lite con = lite.connect('C:\\Python\\Options.sqlite') with con: SQLstatement = "SELECT * FROM OptionTrades" cur = con.cursor() cur.execute( SQLstatement ) rows = cur.fetchall() # rows is a list for row in rows: print( row ) con.close() Now, we will test out several SQL statements. Run the program and embed the SQL statement shown. SELECT OptionSymbol,StockSymbol,Year,Month,Strike,Bid,Ask,OpenInt FROM OptionContracts Note that the columns are displayed in the order that they appear in the SELECT statement. If all columns from a table were needed to be part of the result set, we do not © 2024 Ben Van Vliet 80 need to explicitly specify them. Rather, in the case where all columns are to be selected, we can use the * symbol. WHERE Clause The previous example retrieved a result set that included all the rows in the table from the specified columns. However, we may want to filter some rows out according to some condition or based upon some comparison. This is where the WHERE clause comes in. For comparison in SQL, we use the following operators: Comparison Description Operator < Contents of the field are less than the value. Contents of the field are greater than the value. >= Contents of the field are greater than or equal to the value. = Contents of the field are equal to the value. Contents of the field are not equal to the value. BETWEEN Contents of the field fall between a range of values. LIKE Contents of the field match a certain pattern. IN Contents of the field match one of a number of criteria. So, if we were interested in only the option contracts with open interest greater than 1,000, our SQL would look like this: SELECT * FROM OptionContracts WHERE OpenInt > 10000 The WHERE clause can also have multiple conditions using AND or OR. If we wanted to see all contracts where open interest is over 1,000 and the bid is greater than 0, it would look like this: SELECT * FROM OptionContracts WHERE OpenInt > 10000 AND Bid > 0 If we needed to build a WHERE clause for such a field, SQLite requires that we use single quotes for string comparison like this: © 2024 Ben Van Vliet 81 SELECT * FROM OptionContracts WHERE StockSymbol = 'IBM' Date comparison sometimes requires the use of the pound sign, # or ‘, but SQLite uses single quotes. For example, if we wanted to see all of the options trade