Ch7 - Introduction to Time Series Analysis PDF
Document Details
Uploaded by MilaBobo
Universiteit Stellenbosch
CS van der Westhuizen
Tags
Related
- Neural Networks - NN for Time Series Analysis PDF
- Intuitions on Time Series Data PDF
- Reviewer Forecasting and Time Series PDF
- ECO2012 Tutorial 4b 2012 PDF
- Introduction to Business Analytics BADM3400 Forecasting Techniques Chapter 9 PDF
- Business Statistics for Contemporary Decision Making (6th Edition) PDF
Summary
This document introduces time series analysis, a statistical method for analyzing data collected over time. It details the importance of forecasting in various fields and provides an overview of forecasting techniques. The document showcases different types of time series data, like business, meteorological, and agricultural data.
Full Transcript
Chapter 7 Introduction to time series analysis Statistics (19658)-214 CS van der Westhuizen 1 Introduction Time series analysis is statistical methodology that deal...
Chapter 7 Introduction to time series analysis Statistics (19658)-214 CS van der Westhuizen 1 Introduction Time series analysis is statistical methodology that deals with time series data, or data points indexed (or listed or graphed) in time order1. Time series data are prevalent in fields such as economics, finance, and medicine. This type of analysis is essential for understanding underlying patterns in temporal data and making forecasts based on those patterns. Why is fore- casting important? Forecasting has become essential to decision makers. Forecasting is used by physical scientists, business managers, social scientists, gov- ernment leaders, etc. Researchers have developed many forecasting techniques over the cen- turies. The outcomes of many phenomena can nowadays be forecasted quite easily. For example, the sunrise can be predicted, the speed of a falling object, rainy weather, and much more. 1 This note set is based on Makridakis et al. (1998) 1 If we can measure something, we can understand it and if we can un- derstand it, we can predict/ forecast what it will be. We will mainly focus on time series forecasting. Time series data are observed daily in many areas. The following are just a few examples: 1. Business: weekly interest rates, daily closing stock prices, monthly price indices, yearly sales figures. 2. Meteorology: daily high and low temperatures, annual precipita- tion and drought indices, and hourly wind speeds. 3. Agriculture: we record annual figures for crop and livestock pro- duction, soil erosion, and export sales. 4. Biology: the electrical activity of the heart at millisecond intervals. 5. Ecology: the abundance of an animal species A time series is a sequence of data points measured at successive points in time spaced at uniform time intervals. For example, the daily closing value of the stock market, monthly rainfall data, or annual GDP growth rates are all examples of time series data. The purpose of time series analysis is then generally two-fold. One, to understand or model the stochastic mechanism that gives rise to an observed series. Two, to forecast the future values of a series based on the history of that series and, possibly, other related series or factors. 2 An overview of forecasting techniques Forecasting situations vary widely in their applications (e.g., time horizons, factors determining actual outcomes, types of data patterns). The figure below is from Makridakis et al. (1998) and showcases four different types of time series. This figure visually depicts time series data on: (a) Australian electricity data; (b) US Treasury bill contracts; (c) Sales of product C, and; (d) Australian clay brick data. In Figure 1a we see an increasing trend over time in the time series. There is also seasonal behaviour present in the data. Over time we see an increase in the variation of the time series. This pattern is often referred to as multiplica- tive seasonality. Figure 1b shows a time series that only shows a decreasing 2 Figure 1: Historical data on four variables for which forecasts might be re- quired. trend over time. In Figure 1c the time series is relatively constant around zero, with a few spikes. Finally, in Figure 1d the time series starts with an increasing trend, then becomes almost constant over time. This time series contains seasonal behaviour and well as a cyclical pattern. The seasonal pat- tern is visible within each cycle. Seasonality is a constant behaviour in fixed time periods (e.g., quarterly, or yearly). A cycle is not constant and takes place over larger periods. From these data sets we can see the diversity of areas where we observe time series data. We are interested in using data sets such as these to forecast values in the future. Many techniques have been developed to deal with these diverse applications. Makridakis et al. (1998) puts them into two major categories: quantitative and qualitative methods. Quantitative methods can be used when sufficient 3 quantitative information is available, and usually includes methods to predict the continuation of historical patterns such as the growth in sales or gross national product (i.e., traditional time series analysis). Quantitative meth- ods can also be used to understand how independent variables affect the time series (i.e., explanatory analysis). For example, how does variables such as prices and advertising affect sales. On the other hand, qualitative methods are used when no sufficient quantitative information is available, but there is sufficient qualitative information (we will not consider qualitative forecasting in Statistics 214). With time series models we can forecast the future which is based on past data of a variable, but not necessarily on explanatory variables which may affect the forecast. The purpose of time series models is to discover the pattern in the historical data and extrapolate that pattern into the future. Examples of time series models include exponential smoothing methods and Box-Jenkins models (to be covered in Statistics 348). In explanatory analysis, regression models assume that the response variable to be forecasted has an explanatory relationship with one or more indepen- dent variables (predictors). Examples of such models include least squares regression and logistic regression (this topic will be covered in Statistics 244). Finally, it should be noted that quantitative forecasting can be applied when three conditions exist: 1. Information about the past is available. 2. This information can be quantified in the form of numerical data. 3. It can be assumed that some aspects of the past pattern will continue into the future. This last condition is called the assumption of continuity; it is an underlying premise of all quantitative methods. 3 The basics steps in a forecasting task In Makridakis et al. (1998) the following five basic steps are outlined in any forecasting task for which quantitative data are available. 4 Step 1: Problem definition What do we want to forecast? This is some- times a difficult task, and it requires the forecaster to do a lot of research about who wants the forecasts, what needs to be forecasted, how you get the data, and so on. It is important to have a clear goal about what you want to achieve. Step 2: Gathering information It is necessary to obtain past data once you have defined your problem. This data is used to fit the forecasting model. Other information about the process may play an important role as well. For example, special events like an economic crisis, public holidays, technological advances. These events often occur as spikes in the time series, and we need to take them into account when fitting a forecasting model. Step 3: Preliminary analysis What do the data tell us? Are there in- teresting patterns (i.e., trends, cycles, seasonality, outliers)? Once we have the data, we would like to make some graphical and numerical summaries to understand the data a bit better. Such methods will be discussed in the next section. Identifying any patterns are useful in choosing the correct forecast- ing method. Step 4: Choosing and fitting models This step involves choosing and fitting several quantitative forecasting models. This topic is covered in depth in Statistics 348, but we will consider a few basic approaches later in this chapter. Step 5: Using and evaluating a forecasting model Once a model has been chosen and its parameters estimated appropriately, the model is to be used to make forecasts, and the users of the forecasts will be evaluating the pros and cons of the model as time progresses. A forecasting assignment is not complete when the model has been fitted to the known data. The per- formance of the model can only be properly evaluated after the data for the forecast period have become available. 5 4 Basic forecasting tools In this chapter we will make use of four example data sets to illustrate the time series analysis methodology. The data sets represent two types of data, that is, cross-sectional data and time series data. A preview of the first data set (cross-sectional data) is shown below. This data set, consisting of 30 observations and 8 variables, represents a customer survey that is used in the marketing division of Company X. With this data set we would like to understand how various customer characteristics (age, education, income level) relate to purchasing (monthly spend, number of purchases, satisfaction level). It is important to note that this data set is cross-sectional because all observations were observed at a single point in time. ID Age Edu_Level Inc_Level Monthly_Spend NumPurchases Sat_Level 001 24.00 Bachelor’s Medium 150.00 12.00 4.00 002 35.00 Master’s High 250.00 8.00 3.00 003 29.00 High School Low 75.00 5.00 5.00 004 45.00 PhD High 300.00 15.00 2.00 005 32.00 Bachelor’s Medium 180.00 9.00 4.00 006 28.00 Bachelor’s Low 90.00 4.00 5.00 : : : : : : : Table 1: Preview of the customer survey data. A preview of the second data set (time series data) is shown in Table 2. This data set represents monthly global oil prices (in USD) for 2022 and 2023. This is typically what a time series looks like, which consist of observations over time. With time series data it is important to take note of the time units (e.g., daily, hourly, monthly). It is clear that the oil price data is a monthly time series. A preview of the third data set (time series data) is shown in Table 3. This data set represents market stock prices (in ZAR) for a certain company in South Africa. This time series is a daily series. A preview of the fourth data set (time series data) is shown in Table 4. This data sets represents Australian beer production (in megalitres) from 1991 to 1995. Given the data sets in Table 1, Table 2, Table 3 and in Table 4 the next step would be to use graphical displays and numerical summaries to understand 6 Date Average_Price 2022/01/01 65.50 2022/02/01 68.30 2022/03/01 70.00 2022/04/01 72.25 2022/05/01 69.75 2022/06/01 73.50 : : Table 2: Preview of the global oil price data set. Prices are shown in USD. Date Price 2023/01/01 65.60 2023/01/02 66.39 2023/01/03 66.20 2023/01/04 66.48 2023/01/05 66.90 2023/01/06 67.19 : : Table 3: Preview of the stock price data set. Prices are shown in ZAR. and describe the data better. In Statistics 214, we will make use of Excel and R to perform analyses. 4.1 Graphical summaries 4.1.1 Time plots and time series patterns In this section we will look at a few time series plots that are helpful in un- derstanding the data. We will also discuss the important time series patterns which are often present in a time series. To plot the oil price data, do the following in Excel. 1. Sort the data: Ensure that your data is sorted by date from oldest to newest to make the plot accurate and easy to understand. 2. Select Your Data: Click and drag to select the data you have entered. 7 Year Month Production 1991 Jan 164 1991 Feb 148 1991 Mar 152 1991 Apr 144 1991 May 155 1991 Jun 125 : : Table 4: Preview of the Australian beer data set. Production is shown in megalitres. 3. Insert a Chart: Navigate to the Insert tab on the Ribbon. In the Charts group, click on the Insert Line or Area Chart button. Choose Line with Markers. This will give you a clear view of each data point connected by a line, which is typical for time series data. 4. Chart Title: Click on the default chart title to edit. Enter a mean- ingful title such as “Monthly Global Oil Prices”. 5. Axis Titles: Click on the chart to select it. Go to the Chart Design tab on the Ribbon and click Add Chart Element. Choose Axis Titles > Primary Horizontal and enter “Date” for the horizontal axis. Repeat the process for Primary Vertical and enter “Average Price (USD)” for the vertical axis. 6. Format Date Axis: Right-click on the horizontal axis (the dates) and select Format Axis. In the Axis Options, choose the appropriate date format under the Type dropdown to ensure dates are displayed clearly, such as “mmm-yy” for “Jan-22”. Adjust the Axis Bounds and Units if necessary to achieve a suitable scale and increment for your timeline. Under the Number dropdown you can also edit the format of the dates. You can also adjust the minimum and maximum of each axis to improve the visualisation of the time series plot. 7. Gridlines: Adding grid lines can make your chart easier to read. Go to Add Chart Element > Gridlines > Primary Major Horizontal. Data Labels (Optional): If you want to display data values on the chart: 8 Click on a data series to select it. Go to Add Chart Element > Data Labels > Above. 8. Chart Style: Experiment with different chart styles and colors by selecting your chart and choosing from the options in the Chart Styles group on the Chart Design tab. An example of the oil prices data is given in Figure 2. In R you can use the following code to produce a time series plot. Figure 2: Global monthly oil prices (in USD). Time series plot in R for the oil data oil_prices_ts = ts(oil_prices$Average_Price, start=c(2022, 1), frequency=12) plot(oil_prices_ts, ylab="Oil prices (in USD)", xlab="Date") The time series plot for the oil price data produced in R is shown in Figure 3. You can also create a time series plot with the ggplot2 package in R. 9 Figure 3: Time series plot of global monthly oil prices (in USD). This plot was produced in R. Time series plot in R with the ggplot2 package for the stock price data library(ggplot2) ggplot(stock_prices_daily, aes(x = Date, y = Price)) + geom_line() + labs(title = "Daily stock prices", x = "Date", y = "Price (Rands)") + theme_minimal() The time series in Figure 2 shows a clear upward trajectory with a dip in the month of November in 2022. The data does not show much noise, and the trend is clearly visible. On the other hand, in Figure 4 the data has “more noise”, but a downward trend is also visible. In Figure 5 the time series does not seem to have any trend (upward or downward) and varies around a constant. However, we see a seasonal behaviour over time. At every 12th 10 Figure 4: Time series plot of daily stock prices (in ZAR). This plot was produced in R with the ggplot2 package. month there is a huge spike/ increase in the number of megalitres produced. This is typical of seasonal behaviour. In this case it happens at the end of each year, thus yearly seasonality is present in the production of the beer. Example 1 Construct a time series plot for the stock prices data in Excel. Ensure to add relevant titles for the axes as well as a main title (the caption). 4.1.2 Typical time series patterns In Makridakis et al. (1998) the following summary is given on typical time series patterns. 11 Figure 5: Time series plot of Australian beer production data shown in megalitres. This plot was produced in R. 1. A horizontal (H) pattern exists when the time series fluctuate horizon- tally around a constant mean. Such a series is called stationary in its mean. A product whose sales do not increase or decrease over time would be of this type. 2. A seasonal (S) pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week). Sales of products such as soft drinks, ice creams, and household elec- tricity consumption all exhibit this type of pattern. The beer data show seasonality with a peak in production in November and December (in preparation for Christmas) each year. 3. A cyclical (C) pattern exists when the data exhibit rises and cycli- cal falls that are not of a fixed period. For economic series, these are usually due to economic fluctuations such as those associated with the business cycle. The sales of products such as automobiles, steel, and major appliances exhibit this type of pattern. The clay brick produc- 12 tion shown in Figure 1d shows cycles of several years in addition to the quarterly seasonal pattern. The major distinction between a sea- sonal and a cyclical pattern is that the former is of a constant length and recurs on a regular periodic basis, while the latter varies in length. Moreover, the average length of a cycle is usually longer than that of seasonality and the magnitude of a cycle is usually more variable than that of seasonality. 4. A trend (T) pattern exists when there is a long-term increase or trend decrease in the data. The sales of many companies, the gross national product (GNP), and many other business or economic indicators follow a trend pattern in their movement over time. The electricity produc- tion data shown in Figure 1a exhibit a strong trend in addition to the monthly seasonality. On the other hand, the beer data in Figure 5 shows no trend. 5. An irregular (random) component (E) usually exist if there are no other identifiable pattern. 4.1.3 Seasonal plots From Figure 5 we noticed a strong seasonal trend in the Australian beer production data. It could therefore be useful to better understand this sea- sonality by plotting each season (i.e., year) as a different line. In Excel this can be done by rearranging the beer production observations in different columns by year, and plotting the lines simultaneously with similar steps as shown before. An example of a seasonal plot of the beer data is shown in Figure 6. In this graph the seasonality of each year is clearer. The Excel file Ch7 - Examples.xlsx presents this example for the beer data. 4.1.4 Visualisations for cross-sectional data The customer survey data in Table 1 is cross-sectional which are typical real- world data sets. Note there is no time component in cross-sectional data. Common visualisation for such data are scatter plots and histograms (for numerical variables), and barplots (for categorical variables). 13 Figure 6: Seasonal plot of the Australian beer production data. This plot was produced in Excel. Exercise 2 Construct a suitable visualisation for each of the variables of the customer survey data. You may do this in Excel or in R. 4.1.5 Numerical summaries In Chapter 1 we explored a variety descriptive statistics that can be used to explain the locality and the spread of a variables. These statistics are also suitable for time series data. For convenience, some of the statistics are 14 shown below n 1X ȳ = yi n t=1 n 1X mean absolute deviation (MAD): = |yi − ȳ| n t=1 n 1X mean squared deviation (MSD): = (yi − ȳ)2 n t=1 n 1 X s = 2 (yi − ȳ)2 n − 1 t=1 Note that we now differentiate between the variance (when we divide the sum of squared deviations by n − 1) and the MSD (when we divide the sum of squared deviations by n). Furthermore, in time series it is customary to use t as teh summation index. Additional statistics are also however required to describe time series data. The autocovariance and autocorrelation are two important statistics that is often used to summarise time series data. It plays a similar role as the covariance and correlation coefficient for cross-sectional data. The sample autocovariance for the kth lag is given by n 1 X ck = (yt − ȳ)(yt−k − ȳ), (1) n t=k+1 where yt is the observation of the time series at time t and ȳ is the mean. yt−1 is the observation at time t − 1, that is, when the lag is k = 1. Similarly, yt−2 refers to the observation at time t − 2, and so on. We can also state the observations as follows: Observation yt−1 is described as “lagged” by one period, yt−2 as “lagged” by two periods, and so on. The sample autocorrelation for the kth lag is given by n P (yt − ȳ)(yt−k − ȳ) t=k+1 rk = n P , (2) (yt − ȳ)2 t=1 15 where rk can be interpreted as the correlation between yt and yt−k. If k = 1, then the sample autocovariance and autocorrelation for one lag are respec- tively given by n 1X c1 = (yt − ȳ)(yt−1 − ȳ), n t=2 Pn (yt − ȳ)(yt−1 − ȳ) t=2 r1 = Pn. (yt − ȳ)2 t=1 This autocorrelation tells us whether there is a correlation between yt and yt−1. Consider the Excel file Ch7 - Examples.xlsx for an example on how to calculate the autocovariance and autocorrelation in Excel (for the beer data). If k = 2, then the sample autocovariance and autocorrelation for two lags are given by n 1X c2 = (yt − ȳ)(yt−2 − ȳ) n t=3 Pn (yt − ȳ)(yt−2 − ȳ) t=3 r2 = Pn. (yt − ȳ)2 t=1 Often, we calculate these autocorrelations for different lags. Together, the autocorrelations r1 , r2 ,... , rK form the autocorrelation function (ACF). Plot- ting these autocorrelations are referred to as a correlogram or an ACF plot. We can easily obtain the autocovariance, autocorrelation and their respec- tive ACF plots (autocovariance function/autocorrelation function) using R. Below we show an example with the beer data. Note that when k = 0, then autocovariance is just equal to the MSD, and the autocorrelation is equal to 1. 16 A quick calculation of autocovariances and autocorrelations in R out_c=acf(beer$Production,lag.max = 6, type="covariance") out_c Autocovariances of series ‘beer$Production’, by lag 0 1 2 3 4 5 6 377.43 158.81 21.36 -22.34 -70.91 -108.43 -159.93 out_r=acf(beer$Production,lag.max = 6, type="correlation") out_r Autocorrelations of series ‘beer$Production’, by lag 0 1 2 3 4 5 6 1.000 0.421 0.057 -0.059 -0.188 -0.287 -0.424 We can also easily construct the ACF in R. An example with the beer data is given below for 20 lags. The ACF is shown in Figure 8. Constructing the ACF in R acf(beer$Production,lag.max = 20, type="covariance") The ACF plots can be used to assist in identifying seasonal trends. For example, in Figure 8 we note the annual beer production spikes around June and December. This means that there is a high correlation at lags k = 6, and k = 12. Example 3 Calculate the autocorrelation and autocovariance up to a lag of k = 2 for the oil price data set. Do the calculations in Excel and in R. 4.1.6 Forecasting methods Various techniques exist that are used to make forecasts in time series anal- ysis. In this note set we will only cover a few basic tools (in Statistics 348 forecasting is dealt with in much more detail). The most basic forecasting method is the so-called Naïve Forecast 1 (or NF1) method. This method uses 17 Figure 7: The autocovariance function (ACF) for the beer data. the most recent observation as a forecast. For example, yesterday’s value is used as forecast for today and today’s value is used as a forecast for tomor- row. We can also use the mean to make forecasts. For example, with the beer data, the forecast of beer production for the remaining months in 1995 are the averages over the previous years for each month. For example, the pre- diction for September 1995 is equal to (138+138+143+143)/4 = 140.5. One can also calculate the mean on all historic data, and use it to make a forecast. A more sophisticated method to make predictions is to use a moving average. A moving average does exactly what its name suggests. It involves “moving” averages to adjust the data and create a forecast. The purpose of this method is two-fold. Firstly, it can be used to smooth out short-term variations in the data. Moving averages are often used to determine the trend direction and can be calculated in various ways, including simple and weighted moving averages (time series decomposition is discussed in Statistics 348). Secondly, 18 it can be used to forecast in that the moving average takes the average of only the latest k observations in the time series. The moving average is then used as the forecast itself. Such a forecast would also work well if the data were stationary around a constant. We can write the formula to forecast with the moving average as follows t 1 X ŷt+1 = yi k i=t−k+1 1 = (yt + yt−1 +... + yt−k+1 ). k Note the forecast is a function of k past observations with the same weight. Comparing the simple mean and the moving average: 1. The moving average only deals with k observations, while the simple mean works with all observations. 2. Both methods can work well for a stationary time series, but cannot handle trend, cyclical, or seasonal patterns. 3. If k is equal to the sample size n, the moving average of order n (or MA(n)) becomes the simple mean. 4. If k = 1, the moving average of order 1 (or MA(1)) becomes the NF1 method. Thus, we should be using 1 < k < n. In Table 5 is an example where the simple mean and the moving average is used to forecast a future value (January 2024) for the oil price data. Note that we denote this forecast as ŷt+1. Example 4 Confirm the forecasts for the oil price data in Table 5. Example 5 Use the stock price data to make a forecast for 1 April 2024 based on the moving average of order 5. 19 Date Average_Price Simple mean MA(3) MA(5) : : 2023/08/01 83.75 2023/09/01 85.5 2023/10/01 84.0 2023/11/01 82.3 2023/12/01 80.5 2024/01/01 74.2 82.27 83.21 Table 5: Forecasting with the oil price data. 4.1.7 Measuring forecasting accuracy The following is list of standard statistical measures used in time series anal- ysis to assess the performance of the forecasting model. n 1X mean error (ME) = (yt − ŷt ) n t=1 n 1X mean absolute error (MAE) = |yt − ŷt | n t=1 n 1X mean square error (MSE) = (yt − ŷt )2 n t=1 yt − ŷt percentage error (PEt ) = yt n 1X mean percentage error (MPE) = PEt n t=1 n 1X mean absolute percentage error (MAPE) = |PEt | n t=1 In these measures, yt refers to the time series observation and ŷt the forecast at time t. Thus, yt+1 and ŷt+1 refers to similar values at time t + 1 (one period into the future). Consider the example below based on the 1995 beer data that shows the calculations for the ME, MAE and the MSE. 20 t yt ŷt yt − ŷt |yt − ŷt | (yt − ŷt )2 1 138 150.25 -12.25 12.25 150.06 2 136 139.5 -3.5 3.5 12.25 3 152 157.25 -5.25 5.25 27.56 4 127 143.5 -16.5 16.5 272.25 5 151 138 13 13 169 6 130 127.5 2.5 2.5 6.25 7 119 138.25 -19.25 19.25 370.56 8 153 141.5 11.5 11.5 132.25 ME=-3.72 MAE=10.47 MSE=142.52 Table 6: Calculating measures to assess forecasting accuracy (ME, MAE, MSE). t yt ŷt yt − ŷt ( yty−ŷ t t ) × 100 | yty−ŷ t t | × 100 1 138 150.25 -12.25 -8.9 8.9 2 136 139.5 -3.5 -2.6 2.6 3 152 157.25 -5.25 -3.5 3.5 4 127 143.5 -16.5 -13 13 5 151 138 13 8.6 8.6 6 130 127.5 2.5 1.9 1.9 7 119 138.25 -19.25 -16.2 16.2 8 153 141.5 11.5 7.5 7.5 MPE=-3.3% MAPE=7.8% Table 7: Calculating measures to assess forecasting accuracy (MPE, MAPE). The summary statistics discussed so far measures the goodness-of-fit of the model on historical data (training data). The drawback of the measures based on the training data is that tend to be biased and will show a good fit. To overcome this problem, we often split the training data into a “training set” and a “test set”. The model is fitted on the training set and the goodness- of-fit measures are calculated on the test set (training of models is discussed in more detail in Statistics 244). 21 Example 6 Use the following predictions for the oil price data, and calculate the ME, MAE, MSE, MPE, and the MAPE. Based on these measures which forecast- ing method is preferred? t yt MA(3) MA(5) 21 85.5 81.28 79.3 22 84 83.42 81.32 23 82.3 84.42 82.67 24 80.5 83.93 83.31 Theil’s U -statistic The next goodness-of-fit statistic is Theil’s U -statistic. This statistic mea- sures the goodness-of-fit of a forecasting model. In this statistic the forecast- ing model is compared to the NF1 method. The model mentioned here can be any model (e.g. Moving Average). The statistic is given as v u n−1 uP u (F P Et+1 − AP Et+1 )2 U = u t=1 n−1 , (3) u t P (AP Et+1 ) 2 t=1 where F P Et+1 = ŷty−y t t is the relative change of the forecast, and AP Et+1 = yt+1 −yt yt is the relative change of the actual observation. We can then therefore write v u n−1 u P ŷt+1 −yt+1 2 u ( yt ) u t=1 U= u. (4) t n−1 P yt+1 −yt 2 yt t=1 We will use Eq. (4) to do the calculation of the U -statistic. It has the following interpretations. If U = 1: The NF1 method is as good as the method being evaluated. U < 1: The method being evaluated is better than the NF1 method. 22 U > 1: The NF1 method is better than the method being evaluated. Thus, there is no point in using it. t yt ŷt Numerator Denominator 1 138 150.25 0.0006 0.0002 2 136 139.5 0.0015 0.0138 3 152 157.25 0.0118 0.0271 4 127 143.5 0.0105 0.0357 5 151 138 0.0003 0.0193 6 130 127.5 0.0219 0.0072 7 119 138.25 0.0093 0.0816 8 153 141.5 - - Total 0.0560 0.1850 q Table 8: Calculating the U -statistic: U = 0.0560 0.1850 = 0.550. In Table 8 the statistic of Theil’s test for the beer data is equal to 0.550 which means that the forecasting model performs better than the NF1 method, because U < 1. Also refer to the Ch7 - Examples.xlsx Excel file. Example 7 Calculate Theil’s U -statistic for the set of predictions in Example 6. How do the two forecasting methods compare to the NF1? 5 Conclusion In this chapter we introduced various concepts in time series analysis. We looked at examples of different types of time series, and compared it to cross- sectional data. In time series analysis it is important to be able to distinguish between horizontal, seasonal, cyclical, trend and random patterns. Seasonal patterns can also be analysed with seasonal plots. The autocovariance and autocorrelation statistics are important to understand the correlation be- tween consecutive times. The moving average was discussed as a basic fore- casting tool, and in this chapter we looked at various metrics to analyse the performance of forecasts. Finally, Theil’s U -statistic can be used to see how a forecasting method compared to the NF1 method. 23 6 Solutions Example 1 Figure 8: Time series plot of stock price data. Price is in ZAR. Example 2 The following graphs were obtained in R. Students may also produce the graphs in Excel. Histogram of age hist(customer_data$Age) 24 Barplot for education level barplot(table(customer_data$Education_Level), xlab="Education level", ylab="Frequency") 25 Barplot for income level customer_data$Income_Level = factor(customer_data$Income_Level, levels = c("Low", "Medium", "High")) barplot(table(customer_data$Income_Level), xlab="Income level", ylab="Frequency") 26 Histogram for monthly spend hist(customer_data$Monthly_Spend) 27 Histogram for number of purchases hist(customer_data$Number_of_Purchases) 28 Barplot of satisfaction level barplot(table(customer_data$Satisfaction_Level), ylab="Frequency", xlab="Satisfaction level") 29 Example 3 The autocovariances are equal to c0 = 36.6, c1 = 31.7, and c2 = 25.3. The autocorrelations are equal to r0 = 1, r1 = 0.868, and r2 = 0.692. The R code is shown below. For the Excel calculations, refer to file Ch7 - Class Example - Solutions. Autocovariance and autocorrelation up to lag 2 (exer3 = acf(oil_prices$Average_Price, type = "covariance", lag.max = 2)) 0 1 2 36.6 31.7 25.3 (exer3 = acf(oil_prices$Average_Price, type = "correlation", lag.max = 2)) 0 1 2 1.000 0.868 0.692 Example 4 For the Excel calculations, refer to file Ch7 - Class Example - Solutions. 30 Example 5 ŷt+1 = 59.09...+58.89...+58.90...+58.62...+58.60... 5 = 58.82457, where t = 456. Example 6 For the Excel calculations, refer to file Ch7 - Class Example - Solutions. If we use the ME, MAE or the MSE, then the MA(3) predictions are pre- ferred, but if we use the MPE or the MAPE, then the MA(5) predictions are preferred. Example 7 For the Excel calculations, refer to file Ch7 - Class Example - Solutions. Both methods produced U -statistics that are more than 1 (1.42 and 1.35, respectively for MA(3) and MA(5)). Therefore, both methods are worse than the NF1 method. 31 References Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (1998). Forecasting: Methods and Applications. (3rd ed.). John Wiley & Sons. 32