MSF 503 Chapter 3 Financial Data PDF
Document Details
Uploaded by FeatureRichParabola
Illinois Tech
Ben Van Vliet
Tags
Summary
This chapter from the MSF 503 course discusses various types of financial data, including price data, valuation data, fundamental data, calculated data, economic data, unstructured data, and sentiment data. It also details the importance of data cleaning and the different issues that may be encountered, such as bad values, missing data, and formatting problems.
Full Transcript
CHAPTER 3 FINANCIAL DATA “The data is always the problem.” 3.1 Types of Databases in Finance We won’t look at all of them in this text, but there are at least six types of financial data we use in finance: Price Data consists o...
CHAPTER 3 FINANCIAL DATA “The data is always the problem.” 3.1 Types of Databases in Finance We won’t look at all of them in this text, but there are at least six types of financial data we use in finance: Price Data consists of the bid and ask prices and quantities, and trade prices and quantities for securities and derivatives. Valuation Data is different from price data. For some financial instruments— bonds, swaps, and all OTC derivatives—no price data exists, or if it does, it is a highly guarded secret. For these, valuation data is all there is. That is, the price exists only in theory, and, furthermore, is not a firm bid or offer that any market maker is obliged to honor. Fundamental Data consists of everything that is disclosed in 10-Q quarterly and 10-K annual reports, including key business items, such as earnings, sales, inventories, and rents. Calculated Data is data that is calculated from fundamental data, such as ROE, price to book, beta, forecasted dividends, free cash flow, etc. Index values, such as the Dow Jones and S&P 500 are calculated and broadcast in real-time. Economic Data, such as CPI and GDP, are key indicators often used in financial analysis and trading. Unstructured Data consists of things like news articles, pictures, Twitter feeds, and product reviews. Sentiment Data quantifies the emotional content embedded in unstructured data. 3.2 Data Dictionary © 2024 Ben Van Vliet 54 As we have seen, we will be accessing data in different file formats. Regardless of the format, however, the data almost always comes in tables consisting of rows (or records) and columns (or fields). It is very important to create a data dictionary for every data set. A data dictionary contains the field names in a data set, along with their data types and descriptions. Here is an example of a data dictionary for the above data. Data Field Name Description Valid Values Type symbol string The symbol of the financial instrument. OIL, CHEM, IT, industry string The industry abbreviation. TRANS quantity int The number of shares owned. entry_price float The price the stock was bought at. The profit or loss since the stock was profit_dollar float bought. winlose char Whether the trade is a winner or a loser. W, L 3.3 Data Cleaning Good inputs are the key to success in financial modeling, and forecasting, as well as risk management, requires good, clean data for successful testing and simulation. Virtually no data, however, is perfect and financial engineers spend large amounts of time cleaning errors and resolving issues in data sets. It is very easy and very common to underestimate the amount of time preprocessing will take. © 2024 Ben Van Vliet 55 First, identify and categorize all the types of problems you expect to encounter in your data; then survey the available techniques to address those different types of errors; and finally develop methods to identify and resolve the problems. Data problems fall into one of four categories: Bad, or incorrect, data. Formatting problems. Outliers, which skew results. Point-in-time data problems. Types of Bad Data Example Bad values Tick of 23.54, should be 83.54 Missing data Blank field or data coded as “9999,” “NA,” or “0” Bad dates 2/14/12997 Column shift-data Value printed in an adjacent column File corruption Network errors Data from different vendors may come in different Different data formats formats or table schemas Missing fundamental data The company may have changed the release cycle Some things to look for algorithmically when attempting to clean a data set. Scanning for Bad Data Intra-period high less than closing price Intra-period low greater than opening price Volume less than zero Bars with wide high/low ranges relative to some previous time period Closing deviance. Divide the absolute value of the difference between each closing price and the previous closing price by the average of the preceding 20 absolute values Data falling on weekends or holidays Data with out-of-order dates or duplicate bars Price or volume greater that four standard deviations from rolling mean 3.3.1 Identify Required Cleaning Activities and Algorithms © 2024 Ben Van Vliet 56 All data, both real-time and historical, should be assumed to contain errors and issues. For example, data issues for high frequency systems will center more on clean tick data, whereas those for systems with longer-term holding periods will focus more on, say, dividends and releases of and revisions to financial statements. 3.3.2 Bad Data and Outliers In all cases cleaning of bad data is a process that consists first of detection, then classification of the root cause of the error, and then correction of the error. Whatever method you use to accomplish these must be shown to operate on both historical and real- time data. This matters because data cleaning algorithms can add latency to real-time systems. Algorithms that cannot be performed in real time prior to trade selection should not be used on historical data. Since many models are sensitive to even a few bad data points, I recommend looking carefully at means, medians, standard deviations, histograms, and minimum and maximum values of time series data. A good way to do this is to sort or graph the data to highlight values outside an expected range, which may be good (but outlying) or bad data. Outliers may good or real data points, but may obliterate the underlying structure, or information, otherwise present in the data. Brownlees and Gallo (2005) propose the following: Construct a z-score heuristic and then eliminating (or “trimming”) observations if the absolute value of the score is beyond some threshold. The z-score is computed using a moving average, moving variance, and a parameter g, which induces a positive lower bound on the heuristic variance, which is useful for handling sequences of identical prices as: But, why not winsorize the outliers, rather than trim them? For example, set all outliers to a specified percentile of the data. Say, any data point above the 95%ile would be “pulled in,” or set to the value that is the 95%ile. © 2024 Ben Van Vliet 57 LAB 3.1: Winsorizing in Practice Consider the following data set: 100, 98, 105, 100, 138, 98, 101, 97, 104, 95, 54, 101, 104, 99, 100, 109, 105, 100, 95, 99 The boss is concerned that outliers in the data may ruin a forecasting model. Write a Python script that will winsorize the outliers to the 5 / 95%ile. As discussed, do it in Excel first. Graphing the data shows us that there are outliers. Now, write the Python code so that you get the same result as you did in Excel. import numpy as np from scipy.stats import norm data = np.array( [ 100, 98, 105, 100, 138, 98, 101, 97, 104, 95, 54, 101, 104, 99, 100, 109, 105, 100, 95, 99 ] ) © 2024 Ben Van Vliet 58 avg = data.mean() stdev = data.std() pct_5 = norm.ppf(.05, loc = avg, scale = stdev ) pct_95 = norm.ppf(.95, loc = avg, scale = stdev ) pct_5 = round( pct_5 ) pct_95 = round( pct_95 ) for x in np.nditer( data, op_flags = ['readwrite'] ): if x < pct_5: x[...] = pct_5 if x > pct_95: x[...] = pct_95 print( data ) To be perfectly honest, I had to use Google four times in order to get this to work right. © 2024 Ben Van Vliet 59 Cleaning historical data is a trade-off between under and over-cleaning. What is bad data to you may not be bad data to me. What is clean at 10-minute intervals may be dirty at 10-second intervals. There is no single, correct way to clean data that applies to all scenarios. Models built on over-cleaned historical data may have problems with real- time data that contains errors. The objective is to manage the trade-off to produce a historical data series where problems are dealt with, but where the real-time properties of the data are maintained. 3.3.3 The Point-in-Time Data Problem Even “cleaned” data can have problems. Consider the following scenario: stock price data for a trading day is cleaned after the close of business. The next day, your data vendor sends out an adjustment file reflecting the corrected, or cleaned, price data. This occurs regularly due to block trades and other off-exchange trades being reported late, all of which become part of the day’s historical price data. However, you didn’t have the cleaned data while you were trading that day. If you use the cleaned data for a model, that’s not the data you would have had at the time. Further, with the new, cleaned price data, many calculations for derivatives prices from the previous day, which were based on the uncleaned data, will now be wrong. Implied volatility calculations will be wrong. All the historical prices of OTC derivatives that are based on the stock’s implied volatility will now be wrong. End-of-day rebalancing algorithms could also be wrong due to the uncleaned data used to run the models. So, cleaned data may not be better. An algorithm for calculating end-of-day prices is very important, but not as easy as it sounds. At the close, block orders affect prices, traders sometimes push bids and asks around, and small orders can move the market in the absence of late liquidity. Market fragmentation (there are at least 60 execution venues for stocks in the U.S.) make it difficult to define a single closing price for a stock. To calculate a clean closing price for a stock, it may be advisable to get a non-official closing price five minutes prior to the close across selected exchanges. Throw out all locked or crossed bids and asks, and then determine the weighted mid-point price. Futures prices are generally fairly clean since contracts trade only on one exchange. Option prices are very difficult to deal with, © 2024 Ben Van Vliet 60 however, due to market fragmentation, liquidity issues, and issues with the prices of the underlyings. 3.3.4 Synchronizing Data Databases of the different types listed above have different updating periods. Take splits, for example. Price data vendors update daily. Balance sheet data vendors may update their databases weekly. As a result, a given ratio, such as sales-to-price, may contain an unsplit sales figure and a split price. Fixing this problem is called synchronizing the data, accomplished by either buying synchronized data from a vendor or performing the task in-house. The real key to synchronizing data, or blending data, is a data map, sometimes called a Rosetta Stone. A Rosetta Stone is the set of unique identifiers used to link data and instruments across databases from various vendors. A proper Rosetta Stone allows the trading or investment firm to trade many instruments—stock, options, bonds, CDs, and OTC products—on a single underlying instrument. Further, developing unique identifiers across underlyings and across vendors enables the blending of proprietary data with purchased data. 3.4 Rescaling Data 3.4.1 Normalizing The process of normalization rescales a dataset over the range [ 0, 1 ]. xi xmin xi ,norm xmax xmin While this may be useful in some cases, outliers will be lost. We use normalization when we want to put things in probability space, since probabilities range from 0 to 1. We can also normalize over some other range. Given raw data in column A, we can convert it to the new normalized score in column C. Start by sorting the raw data. © 2024 Ben Van Vliet 61 Given data x1 through xn where i = 1,..., n, the cumulative probabilities in column B are found as: xi xmin zi min (max min) xmax xmin In this next table, given the ranks of the raw data in column A, we can convert it to the new standardized score in column C. The Excel formulae for generating the data are the same as in Table 1. The difference between simple ranking and distribution fitting is that using ranks is like fitting to a uniform distribution. Figure 1: Simple Ranking Fits to a Uniform Distribution © 2024 Ben Van Vliet 62 As can been seen from Figure 1, two ranks in the neighborhood of P( a ) will map the appropriate distance apart, as will two points in the neighborhood of P( b ), because of the constant slope of F( x ) in a uniform distribution. 3.4.2 Standardizing Data More often, what we want to do is standardization, which rescales the data set to express the values as z-scores, their number of standard deviations from the mean. Thus, we have to remove the mean from the training data (i.e. center it) and scale it to unit variance. xi zi Suppose that, given raw fundamental data (e.g. earnings per share, price-to-book ratio, etc.), we wish to standardize the data by fitting it to the normal distribution in this way, but say between plus and minus 2 standard deviations. (The probabilities associated with this range are 2.275% and 97.725% (using Excel’s NORMSDIST() function)). Figure 2: Fitting Ranked Data to a Normal Distribution © 2024 Ben Van Vliet 63 Fitting the ranks to a normal distribution is different. As can be seen in Figure 2, two points in the neighborhood of P( a )—such as data points with ranks 1 and 2—will map further away than will two points in the neighborhood of P( b )—such as data with ranks 455 and 456—because the slope of F( x ) not constant, and is steeper at b than a. So, distribution fitting takes differences in the ranks of observations (or in some cases the observations themselves), and imposes a distributional prior as to how much importance gaps in neighboring observations should have. The distributional method determines the importance of outliers. import sklearn.preprocessing as pp import numpy as np data = np.array( [ -.50, -.25, -.22, -.18, 0.0,.10,.20 ] ) data = data.reshape( 7, 1 ) # STANDARDIZE the data by removing the mean and scaling to unit variance scaler = pp.StandardScaler() std_data = scaler.fit_transform( data ) print( '\nSTANDARDIZED DATA:\n' ) print( std_data ) # NORMALIZE the data by removing the mean and scaling each data point to a given range mm_scaler = pp.MinMaxScaler( feature_range = ( -2, 2 ) ) mm_data = mm_scaler.fit_transform( data ) print( '\nSTD MINMAX DATA:\n' ) print( mm_data ) # NORMALIZE the ranked data the same two ways ranks = np.array( [ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0 ] ) ranks = ranks.reshape( 7, 1 ) std_ranks = scaler.fit_transform( ranks ) print( '\nNORMALIZED RANKED DATA:\n' ) © 2024 Ben Van Vliet 64 print( std_ranks ) mm_ranks = mm_scaler.fit_transform( ranks ) print( '\nSTD MINMAX RANK DATA:\n' ) print( mm_ranks ) 3.4.3 Difference Bins A variation on the ranking theme is to scale the ranks by the difference between data points, so that points with larger differences between them have a correspondingly large gap between their ranks. This is typically done by placing the differences into bins. The steps are as follows: Step 1: Sort the data from low to high. Step 2: Find the differences between points. Step 3: Put the differences into bins 1…m according to their size. That is, small differences go into bin 1, and the largest differences go into bin m. Step 4: To assign ranks to the data, give the smallest data point a rank of 1, and then add the bin value to each successive rank, so that each value gets assigned a rank that differs from the previous rank by the bin value. Thus, there will be gaps in the numbering. Step 5: Finally, proceed with distribution fitting as before. Here is a numerical example using 3 difference bins to illustrate this technique. A B C D Difference Raw 1 Raw Data Difference Bin Rank 2 -1 1 3 -.5.5 2 3 4 -.3.2 1 4 5 -.2.1 1 5 6 0.2 1 6 7.1.1 1 7 8.5.4 2 9 9 2 1.5 3 12 © 2024 Ben Van Vliet 65 In this table, given the sorted raw data in column A, we can convert it to the new raw ranks in column D. These raw ranks can be used as inputs into the previous example to generate a normalized score. 3.5 Ranking in Practice Financial data is often categorized for the purpose of generating factors, or indicators. Valuation factors are often scaled by the industry sector. Volatility factors are sometimes scaled by capitalization group. For cross-sectional analysis, sector and groups are often defined in fundamental data. For time-series analysis, data may be grouped by time periods, months, days, or intra-day periods. For example, we could look at the monthly ROE for the top decile stocks. If the distribution of data in each group, or sector, or category, or “bucket,” is different, then standardizing by group may not help much, unless you use a ranking method. Table 4 contains some sample data that should illustrate the value of ranking by groups. A B C D E F G H I 1 Group Raw Data 2 A 22 25 26 28 30 35 36 39 3 B 5 6 6 7 7 7 8 9 A B C D E F G H I 4 Group Raw Ranks 5 A 1 2 3 4 5 6 7 8 6 B 1 2 2 5 5 5 7 8 In the first table here, the data for group A clearly indicates a different distribution. By ranking the data by group, we can compare apples to apples. 3.6 Double Standardizing It is also possible to double standardize. That is, we standardize one way, then the other. In this case the order of standardization—cross-sectional first, or time-series first—is important to the final meaning of the factor. © 2024 Ben Van Vliet 66 For example, in the case of performing time-series standardization first, then cross-sectional standardization, consider the data for IBM: IBM Month Factor Data Z-Score March 1.02 -.23 April 2.21.96 May 1.00 -.25 June 3.52 2.27 After the time-series standardization, the factor data and z-score for June clearly appear to be unusually high. However, after a cross-sectional standardization, as can be seen next, most other stocks seem to also be high in June. Stocks June Symbol Z-Scores IBM 2.27 LUV 1.95 INTC 2.35 WMT 2.02 So, 2.27 is nothing special in terms of upward movement. © 2024 Ben Van Vliet 67 LAB 3.2: Ranking and Standardizing Given the following fundamental data and two stocks: A B 1 ABC XYZ 2 12.50 0.12 3 9.82 0.25 4 11.77 0.17 5 15.43 0.05 6 19.03 0.31 Rank the data, calculate the Spearman’s rank correlation, and then standardize the data. Fortunately, Python’s sklearn and scipy make this easy. from sklearn.preprocessing import StandardScaler import numpy as np from scipy import stats ABC = np.array( [ 12.50, 9.82, 11.77, 15.43, 19.03 ] ) XYZ = np.array( [ 0.19, 0.25, 0.17, 0.05, 0.23 ] ) # Rank the data ABC_rank = stats.rankdata( ABC ) XYZ_rank = stats.rankdata( XYZ ) print( '\nRANKED DATA:\n' ) print( ABC_rank ) print( XYZ_rank ) # Calculate Spearman's Rank Correlation spear_corr = stats.spearmanr( ABC, XYZ ) print( '\nSPEARMAN CORRELATION:\n' ) print( spear_corr.correlation ) # Standardize the data © 2024 Ben Van Vliet 68 scaler = StandardScaler() ABC_std = scaler.fit_transform( ABC.reshape( -1, 1 ) ) print( '\nSTANDARDIZED DATA:\n' ) print( ABC_std ) © 2024 Ben Van Vliet 69 3.7 Data Cleaning Checklist Pre-cleaning activities: 1. Make a copy of the raw data set, if possible. 2. Create a data dictionary. 3. If necessary, create a data map to document the translation between the raw data set and a new, redefined data set. 4. Check the statistical properties of the uncleaned data. 5. Graph the data, including boxplots and scatterplots, potentially. 6. Create histograms of the data. 7. Check for normality and seasonality. 8. Calculate the correlation matrix and create correlation plots Detection: 1. Prototype error/outlier detection algorithms in Excel. 2. Code and test Python detection algorithms. Classification: 1. Define what counts as an error, an outlier, seasonality, and what needs to be corrected, trimmed, or winsorized. 2. Prototype correction algorithms in Excel 3. Code and test Python classification and correction algorithms. Correction: 1. Check the statistical properties of the cleaned data set. 2. Graph the data, including boxplots and scatterplots, potentially. 3. Create histograms of the data. 4. Check for normality and seasonality. 5. Calculate the correlation matrix and create correlation plots © 2024 Ben Van Vliet 70