Time Series Analysis Lecture Notes PDF

Lecture Notes Time Series Analysis Alexander Lindner Ulm University Winter semester 2024/25 Contents Foreword 4 1 Introduction 6...

Lecture Notes Time Series Analysis Alexander Lindner Ulm University Winter semester 2024/25 Contents Foreword 4 1 Introduction 6 2 Stationary time series and the autocorrelation function 14 3 Data cleansing from trends and seasonal effects 21 3.1 Decomposition of time series.......................... 21 3.2 Estimation of the trend in the absence of seasonality............ 22 3.2.1 Linear regression............................ 22 3.2.2 Polynomial or more general regression................ 25 3.2.3 Moving average smoothing....................... 28 3.2.4 Exponential smoothing......................... 30 3.2.5 Differencing............................... 30 3.2.6 Transformation of the data....................... 31 3.3 Estimating and elimination of seasonality in the absence of trend..... 38 3.3.1 Average method............................. 39 3.3.2 Harmonic regression.......................... 39 3.3.3 Season elimination by differencing of higher order.......... 42 3.4 Estimating trend and season simultaneously................. 42 4 Properties of the autocovariance function 45 5 Linear filters and two-sided moving average processes of infinite order 54 6 ARMA processes 64 6.1 Definition, AR(1) and ARMA(1,1) process.................. 64 1 6.2 Stationary solutions, causal and invertible ARMA processes........ 72 6.3 Homogenous linear difference equations with constant coefficients..... 79 6.4 Three methods to compute the autocovariance function of a causal ARMA process...................................... 88 6.4.1 First method: calculation of the MA(∞) representation....... 88 6.4.2 Second method: finding a recursion for the autocovariance function 89 6.4.3 Third method: the autocovariance generating function....... 90 7 Linear prediction 93 7.1 Hilbert spaces.................................. 94 7.2 Best prediction and best linear prediction................... 100 7.3 Recursive calculation of one-step predictors.................. 110 7.4 The partial autocorrelation function and how to detect visually AR or MA processes?.................................... 121 8 Estimation of the mean value 132 9 Estimation of the autocovariance function 142 10 Yule-Walker estimator for causal AR(p) processes 155 11 Further estimators and order selection 167 11.1 The least squares estimator.......................... 167 11.2 The (quasi–)maximum-likelihood estimator.................. 169 11.3 Order selection................................. 174 2 Bibliography [BD1] Brockwell, P.J. and Davis, R.A. (1990): Time series: theory and methods. 2nd edition, Springer. [BD2] Brockwell, P.J. and Davis, R.A. (2016): Introduction to time series and forecasting. 3rd edition, Springer. [Fu] Fuller, W.A. (1996): Introduction to statistical time series. 2nd edition. Wiley. [Ha] Hamilton, J.D. (1994): Time series analysis. Princeton Univ. Press. [KN] Kreiß, J.-P. and Neuhaus, G. (2006): Einführung in die Zeitreihenanalyse, Springer. 3 Foreword Dear students, these are the lecture notes for the course on Time Series Analysis. For whom is this course? This course is (mainly) for master students in mathematics, business mathematics, math- ematical biometry, finance or CSE. Other students are welcome, and if they are allowed to take the exam by their Study and Examination Regulations, they may also take the exam. Lectures We hope that this year we can give the lecture in presence throughout the semester. I will point out the relevant material and partially refer to the lecture notes for further reading. Not all that you can find in the lecture notes is relevant, we will make a selection at the lecture. Exercises Exercise sheets will be handed out fortnightly (i.e. every second week) in Moodle. You do not have to hand them in and they are no prerequisites for the exam. The exercises are organised by Lorenzo Proietti. More information regarding the exercises will be posted separately. Exam We will decide during the first lecture if we will be having an oral or a written exam at the end of the semester. 4 Literature The lecture notes are based on the book [BD1] by Brockwell and Davis. This is really an excellent book and I will mostly follow it. As further literature, I recommend the books by Brockwell, P.J. and Davis, R.A. [BD2]: This book covers more the practical aspects of time series analysis than the other book by Brockwell and Davis. Fuller, W.A. [Fu]: A very good book, which however concentrates more on the statistical aspects of time series. Hamilton, J.D. [Ha]: Time series analysis. Princeton Univ. Press. (Very good book, but the book is a big one) Kreiß, J.-P. and Neuhaus, G. [KN]: This book is also excellent in my opinion and close to what I will be doing in this lecture. Unfortunately, so far it is only available in German, although the authors are working on an English version. Alexander Lindner. 5 Chapter 1 Introduction This is an introductory chapter. We address the question what a time series is and present some examples. There are two possible definitions of a time-series. One is that it is a collection of data in time, the other is that it is a certain stochastic process which is a model for these data. From the practical point of view, the first definition is the relevant one, from the theoretical point of view, the second. Let us first give the practical definition. Definition 1.1. A time series is a sequence of observations x1 ,... , xT , or (xn )n∈N , or (xn )n∈Z , recorded at specific time points 1, 2,... When the observations are one- dimensional, we speak of univariate time series, when they are multi-dimensional, we speak of multivariate time series. Example 1.2. Denote by xt the monthly sales of Australian red wine per month in kilolitres, taken from ITSM (a programme incorporated in the book [BD2] by Brockwell and Davis), starting from January 1980 (t = 1) over February 1980 (t = 2) to October 1991 (t = 142). The graph is displayed in Figure 1.1. When looking at the data, we see the following features: (i) There seems to be an upward trend, i.e. on average people consume more read wine when time moves on. (ii) There is a seasonal pattern, namely most red wine is drunk in July, while not so much in January. This is because red wine is more often drunk in winter than in summer, and in Australia, January is summer while July is winter. (iii) There is an increase in variability, meaning that the fluctuations become larger and larger with increasing time. So what does one usually do first when one has a time series and is interested in its structure? The first thing is to plot the data. Then one can often already see if there is a trend over time, i.e. if the data increase or decrease in time, and if so, how they do that there is a seasonal pattern, or some cyclic pattern 6 Figure 1.1: Monthly sales of Australian read wine per months, Jan 1980 – Oct. 1991, as discussed in Example 1.2 Australian red wine data 3000 2500 2000 monthly sales 1500 1000 500 1980 1982 1984 1986 1988 1990 1992 time the variability is constant over time or varies there are other systematic features present in the data. In the third chapter we will be concerned with identifying trend or seasonality. In this chapter, we look at some specific examples. Example 1.3. Figure 1.2 shows the monthly total airline passenger numbers (in thou- sands), from January 1949 – December 1960, taken from ITSM. They seem to exhibit a 7 linear upwards trend and a seasonal component, as well as increase in variability. Figure 1.2: International airline passenger data (in thousands) from Janaury 1949 – December 1960. Airpassengers from 1949 until 1960 600 500 The number of passengers 400 300 200 100 1950 1952 1954 1956 1958 1960 the time Example 1.4. Figure 1.3 shows the average monthly temperature (in Fahrenheit) at the place Dubuque (which is in Iowa, USA), in the time period 1964 – 1976. The data are taken from the R-library TSA, data(tempdub). The data seem to exhibit a seasonal component, but no trend. 8 Figure 1.3: Monthly temperature at Dubuque in Iowa from 1964 – 1976 70 60 50 Temperature 40 30 20 10 1964 1966 1968 1970 1972 1974 1976 Time Example 1.5. Figure 1.4 shows the Dow Jones Utilities Index from Aug. 28 – Dec. 18, 1972 (daily data), taken from ITSM. Looking at it there seems to be an upwards trend, but no seasonal component. It is a typical financial time series. A financial time series is often much better analysed using the differenced series or the log returns. Example 1.6. Figure 1.5 displays the daily log returns (multiplied by 100) based on closing prices P (t) of 7 stock indices. The log return is defined as P (t) log = log P (t) − log P (t − 1); P (t − 1) 9 Figure 1.4: Dow Jones Utilities Index from Aug. 28 – Dec. 18, 1972 (daily) Dow Jones Index from Aug. 28 to Dec. 18, 1972 (daily data) 120 Dow Jones Index 115 110 0 20 40 60 80 days from August 28 to December 18, 1972 here, log denotes the natural logarithm. The first return is for July 2, 1997, the last is for April 9, 1999. The indices are: Australian All-ordinaries, Dow-Jones Industrial, Hang Seng, JSI (Indonesia), KLSE (Malaysia), Nikkei 225, KOSTI (South Korea), and the data are taken from ITSM. No trend or seasonal pattern visible. We will hardly analyse multivariate data in this course, but of course there may be connections between the single time series present here. Example 1.7. Further examples of time series include e.g. 10 Monthly unemployment rate in Germany, annual bottom water level of the river nile, Canadian lynx data: number of captured lynx from 1821 to 1934 at the MacKenzie River, stock prices, rates of exchange, log returns sunspot number (average number per year), accident data, diseases, clinical trials, etc. So, we have real data given. But how do probability and statistics come in? The idea is that one models a phenomenon as a stochastic process, and then views the given time series as one realisation of this stochastic process. Let us recall what a stochastic process (with a time domain specified below) is. Definition 1.8. A real valued or Rd -valued stochastic process with index set T = N, T = N0 or T = Z is a sequence (Xt )t∈T of random variables defined on a probability space (Ω, F, P ). We call (Xt )t∈T a time series. The functions (T ∋ t 7→ Xt (ω)) for ω ∈ Ω are called paths or realizations of the time series. In reality, one usually observes only one path. So this is the second and theoretical definition of a time series, namely simply as a stochas- tic process (Xt )t∈T. As said, for most of our analysis in these notes we will work with the second definition, with the understanding that the real data time series is a realisation xt = Xt (ω) for all t and some fixed ω ∈ Ω, so basically we see one path of the stochastic process. The interesting feature in time series analysis is that the stochastic process (Xt )t∈T does not come from an i.i.d. sequence, but that there are dependencies between different Xt. Task 1.9. What are the objectives of time series analysis? (i) The first thing is that one finds a model type for the observed data (xt ), which is usually done using a stochastic process, and one assumes that one has exactly one realisation of that. (ii) Then one is interested in identifying the “obvious” patterns, like trend, seasonality and other deterministic quantities. Although we will not work much with those, this is actually one of the most important things and often has the most influence. We shall treat this shortly in the third chapter. (iii) Having estimated trend, seasonality and other deterministic quantities (call their sum yt ), one subtracts these from the data and is left with the “residuals”zt = xt − yt. In those no obvious structure should be present any more. (iv) Then fit a stochastic model to the remaining residuals (zt ). For that, one needs to know a variety of model classes. 11 (v) Having done this, one should check the model for goodness of fit. This means that it may well be that one has fitted a model class that is not good at all. If so, one has to start again and fit another model class. (vi) Supposing the model is good, one can use this to forecast (i.e. predict) future values. This is of course the issue one is most interested in, to predict. (vii) If the model is good, one can also use it to test certain hypotheses. These are the tasks we try to carry out (at least partially). In the next chapter we shall treat the notion of stationarity. This is the property we usually impose on the residuals. 12 Figure 1.5: Daily log returns of 7 indices as described in Example 1.6 Australia Dow-Jones Industrials Hang Seng 6. 20. 6. 4. 15. 4. 2. 10. 2. 0. 5. 0. 0. -2. -2. -5. -4. -4. -10. -6. -6. -15. -8. -8. 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 Indonesia Malaysia Nikkei 225 8. 20. 20. 15. 6. 10. 10. 4. 5. 2. 0. 0. 0. -5. -10. -2. -10. -15. -4. -20. -20. -6. 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 South Korea 10. 8. 6. 4. 2. 0. -2. -4. -6. -8. -10. 0 50 100 150 200 250 300 350 400 450 13 Chapter 2 Stationary time series and the autocorrelation function In this chapter we will learn what a stationary time series is. The idea is that one usually estimates or eliminates trend and seasonality in data first, and then tries to model the residuals with a stationary model. So what does stationarity mean? There are two notions of stationarity, namely strict stationarity and weak (or second order) stationarity. Suppose throughout that T = N, T = N0 or T = Z. For simplicity, we assume that our time series (Xt )t∈T is real valued (or complex valued, later), and we think of a time series as a stochastic process. Definition 2.1. A time series (Xt )t∈T is said to be strictly stationary, if for all t1 ,... , tn ∈ T , k ∈ N holds that d (Xt1 , Xt2 ,... , Xtn ) = (Xt1 +k , Xt2 +k ,... , Xtn +k ), i.e. the finite dimensional distributions are shift invariant. So not only do I have the same distribution at every time point, but also the distribution of (X1 , X2 ) is the same as the distribution of (X5 , X6 ), so the dependency between X1 and X2 is the same as the dependency between X5 and X6. In some sense, one is in an equilibrium, and this is called strict stationarity. Before we define weak stationarity, let us define the mean function and the covariance function of a time series. Definition 2.2. Let X = (Xt )t∈T be a time series with EXt2 < ∞ for all t ∈ T. Then µX (t) := EXt , t ∈ T, is called the mean function and γX (r, s) := Cov (Xr , Xs ) = E[(Xr − µX (r))(Xs − µX (s))], r, s ∈ T, the covariance function of X. 14 The notion of weak stationarity is indeed weaker (at least if we have finite variance): Definition 2.3. A real valued time series (Xt )t∈T is said to be weakly stationary or second order stationary or simply stationary, if i) EXt2 < ∞ for all t ∈ T , ii) EXt = EXt′ for all t, t′ ∈ T (, i.e. µX is constant), iii) γX (t + h, t) = γX (t′ + h, t′ ) for all t, t′ ∈ T, h ∈ N0 , i.e. γX (r, s) depends only on r − s. Then µX := EXt is called the mean of X, γX (h) := γX (t + h, t) h ∈ N0 (if T = N0 , T = N) or h ∈ Z if T = Z, is called autocovariance of X at lag h. The function/sequence (γX (h))h∈N0 /Z is called the autocovariance function of X (ACVF). The autocorrelation of X at lag h is defined by (if γX (0) ̸= 0) γX (h) ϱX (h) := = Corr (Xt+h , Xt ), h ∈ N0 /Z, γX (0) and (ϱX (h))h∈N0 ,Z is called autocorrelation function of X (ACF). So while strict stationarity is in terms of the finite dimensional distributions, weak sta- tionarity is only defined in terms of the mean and covariances. We have: Proposition 2.4. Let (Xt )t∈T be strictly stationary. d a) For all t, t′ ∈ T holds true that Xt = Xt′. b) If EXt2 < ∞ for all t ∈ T then (Xt )t∈T is weakly stationary. Proof. a) clear, b) exercise/ clear. The easiest example of a strictly stationary sequence is i.i.d. noise: Example 2.5. [IID-Noise] If X = (Xt )t∈T is independent and identically distributed (i.i.d.), then X is called i.i.d. noise or i.i.d. white noise. Denote by ρ the distribution of X1. By the i.i.d. property, the distribution of (Xt1 ,... , Xtn ) for t1 <... < tn is then given by ρ⊗n , the n’fold product measure of ρ with itself. From this we see immediately that (Xt )t∈T is strictly stationary. If EXt = 0 and EXt2 = σ 2 < ∞, we write (Xt )t∈T ∼ IID(0, σ 2 ). i.i.d. noise is the most basic building block for time series when thinking of strict station- arity. When thinking of weak stationarity, the most basic building block is white noise: 15 Example 2.6. [White Noise] If EXt2 < ∞, EXt = 0, Var Xt = EXt2 = σ 2 ∈ (0, ∞) for all t ∈ T and Cov (Xt , Xt′ ) = 0 for all t ̸= t′ , then (Xt )t∈T is called white noise, written as (Xt )t∈T ∼ W N (0, σ 2 ). This time series is weakly stationary and γX is given by 2 σ , h=0 γX (h) = 0, h ̸= 0 There are also examples of time series that are not stationary: Example 2.7. [Random Walk] t Let (Zt )t∈N be IID(0, σ 2 ) with σ > 0 and define St := P Zi , t ∈ N0. It follows that i=1 t Zi = tσ 2 < ∞, P ESt = 0, Var St = Var i=1 γS (t + h, t) = Cov (St+h , St ) = Cov (St + Xt+1 + · · · + Xt+h , St ) = Cov (St , St ) = tσ 2 , so (St )t∈N0 is not weakly stationary, hence neither strictly stationary. The easiest example of a weakly stationary process built from white noise is the moving average process: Example 2.8. [M A(q)-Process] Let (Zt )t∈Z ∼ W N (0, σ 2 ) and Xt := Zt + θ1 Zt−1 + · · · + θq Zt−q , t ∈ Z, with θ1 ,... , θq ∈ R. (Xt )t∈Z is called M A(q)-process or moving-average process of order q. It holds true that EXt2 < ∞, EXt = 0, and with θ0 := 1, q q q q−|h| X X X X γX (h) = Cov ( θi Zt+h−i , θj Zt−j ) = θi θj Cov (Zt+h−i , Zt−j ) = θj θ|h|+j σ 2 , i=0 j=0 i,j=0 j=0 (see this first for h ≥ 0; for h < 0 use γX (−h) = γX (h)), so a M A(q)-process is weakly stationary). Remark 2.9. The converse of Proposition 2.4 a) does not hold true, i.e. there exists a weakly stationary time series which is not strictly stationary. Proof. Exercise. Let us recall the definition of the normal distribution: 16 Definition 2.10. Let σ ≥ 0 and µ ∈ R. A probability measure ρ on (R, B1 ) is a normal distribution with mean µ and variance σ 2 , if either σ = 0 and ρ = δµ , the Dirac measure at µ, or σ ∈ (0, ∞) and ρ has probability density (x − µ)2 1 f (x) = √ exp − , 2πσ 2 2σ 2 i.e. (x − µ)2 Z 1 ρ(A) = √ exp − dx A 2πσ 2 2σ 2 for all A ∈ B1. We write ρ = N (µ, σ 2 ). A random variable X : Ω → R is called normally distributed or Gaussian distributed or a Gaussian random variable, if its distribution is N (µ, σ 2 ) for some σ ∈ [0, ∞) and µ ∈ R. When µ = 0 and σ = 1 we speak of N (0, 1) as the standard normal distribution. d Remark 2.11. If X = N (µ, σ 2 ), then X has indeed expectation µ and variance σ 2. Also, if σ 2 = 0, then X is equal to µ almost surely. Recall that the characteristic function φX : Rd → C of a random vector X : Ω → Rd is given by ′ φX (u) := Eei⟨u,X⟩ = Eeiu X , u ∈ Rd , where ⟨·, ·⟩ denotes the Euclidian product in Rd. We write a′ to denote the transpose of a vector or matrix a. Usually, vectors in Rd will be column vectors. The characteristic function uniquely describes the distribution of a random vector. Normal random variables can be characterised as follows: Proposition 2.12. Let X : Ω → R be a random variable and µ ∈ R, σ ≥ 0. Then X is normally distributed with mean µ and variance σ 2 if and only if the characteristic function φX of X is given by 2 2 φX (u) = eiuµ−σ u /2 ∀ u ∈ R. The proof can be found in any standard text on probability. The definition of a multivariate normal random variable can be deduced from the one- dimensional setting. Definition 2.13. A random vector X = (X1 ,... , Xd )′ : Ω → Rd is GaussianPor normally distributed, if for each a = (a1 ,... , ad )′ ∈ Rd the linear combination a′ X = dj=1 aj Xj is one-dimensional Gaussian distributed. This may be a bit different from the definition you know of multivariate normal distribu- tions. The next theorem gives the relation with the familiar concept. 17 Theorem 2.14. (a) If a random vector X = (X1 ,... , Xd )′ is Gaussian, then each of its components has finite variance and the distribution of X is uniquely determined by the mean µ := E(X) and the covariance matrix Σ := Cov (X) = (Cov (Xi , Xj ))i,j=1,...,d ∈ Rd×d. d We write X ∼ N (µ, Σ) or X = N (µ, Σ) in that case. (b) To every µ ∈ Rd and every symmetric positive semidefinite matrix Σ ∈ Rd×d (i.e. Σ′ = Σ and x′ Σx ≥ 0 for all x ∈ Rd ) there exists a random vector X with the distribution N (µ, Σ). (c) A random vector X : Ω → R is N (µ, Σ)-distributed (where µ ∈ R and Σ ∈ Rd×d ), if and only if the characteristic function of X is given by ′ ′ ′ φX (u) = ei⟨u,µ⟩−u Σu/2 = eiµ u−u Σu/2 ∀ u ∈ Rd. (d) If X ∼ N (µ, Σ) and the matrix Σ is invertible, then X has a probability density which is given by 1 1 ′ −1 fX (x) = p exp − (x − µ) Σ (x − µ) , x ∈ Rd. (2π)d det(Σ) 2 (e) For a random vector X and µ ∈ Rd and a positive semidefinite matrix Σ ∈ Rd×d we have Y ∼ N (µ, Σ) if and only if a′ Y ∼ N (a′ µ, a′ Σa) for all a ∈ Rd. A proof can be found in the book by Brockwell and Davis [BD1], Section 1.6. Observe that the expectation of the vector X is defined componentwise, i.e. E(X1 ,... , Xd )′ = (EX1 ,... , EXd )′. Knowing the multivariate normal distribution, we can define what a Gaussian time series is. Definition 2.15. A time series (Xt )t∈Z is said to be a Gaussian time series, if (Xt1 ,... , Xtn )′ is Gaussian for all n ∈ N and t1 ,... , tn ∈ Z, i.e. all finite-dimensional distributions are normally distributed. For Gaussian time series, the concepts of weak and strict stationarity coincide: Proposition 2.16. Let X = (Xt )t∈Z be a Gaussian time series. Then the following state- ments are equivalent: (i) X is weakly stationary, (ii) X is strictly stationary. 18 Proof. That “(ii) =⇒ (i)” follows from Proposition 2.4 (b), since EXt2 < ∞ for all t. Let us prove that “(i) =⇒ (ii)”. For all n ∈ N and t1 ,... , tn ∈ Z it holds true that d (Xt1 +h ,... , Xtn +h )′ = N ((EXt1 +h ,... , EXtn +h )′ , (Cov (Xti +h , Xtj +h ))i,j=1,...,n ) (i) = N ((µ,... , µ)′ , (γX (ti + h − tj − h))i,j=1,...,n ) = N ((µ,... , µ)′ , (γX (ti − tj ))i,j=1,...,n ) (i) = N ((EXt1 ,... , EXtn )′ , (Cov (Xti , Xtj ))i,j=1,...,n ) d = (Xt1 ,... , Xtn )′. Proposition 2.16 is the reason why one often considers only weakly stationary time series. The idea is that one often has (approximately) a Gaussian time series as they arise in reality by virtue of the central limit theorem, and for Gaussian time series then weak and strict stationarity are the same concept. When we have empirical data, which we assume to be a realisation of a weakly stationary time series, we would like to estimate the mean and the autocovariance function. This is usually done with the following estimators: Definition 2.17. Let x1 ,... , xn be a realisation of an R-valued time series (regardless if is weakly stationary or not). Then the empirical mean of x1 ,... , xn is defined by n 1X x := xt. n t=1 The empirical autocovariance function γ b is defined by n−h 1X γ b(h) := (xt+h − x)(xt − x), h = 0, 1,... , n − 1 (2.1) n t=1 and γ b(h) := γ b(−h), h = −n + 1,... , −1. (2.2) Provided γb(0) ̸= 0 (this is the case if x1 ,... , xn are not all equal), the empirical autocor- relation function is given by γ b(h) ϱb(h) := , −n < h < n. (2.3) γ b(0) In later chapters we shall discuss the asymptotic behaviour of the empirical mean and the empirical autocovariance function. In many cases of interest they are strongly consistent and asymptotically normal estimators for the corresponding mean and autocovariance function of a strictly and weakly stationary process. 19 Sometimes it is necessary to work with complex-valued random variables and time series. If X = Y + iZ, where Y, Z are real-valued, then set X = Y − iZ to be the complex conjugate of X. We define the covariance of two complex random variables in such a way that it is linear in the first component. More precisely, we have: Definition 2.18. a) For two complex-valued random variables X, Y on the same prob- ability space with E|X|2 < ∞, E|Y |2 < ∞, we define Cov (X, Y ) := E(XY ) − (EX)(EY ) = E (X − E(X))(Y − E(Y )) , which is called the covariance of X and Y. When X = Y we call Var (X) = Cov (X, X) the variance of X. Observe that Var (X) ∈ R and even Var (X) ≥ 0, while Cov (X, Y ) ∈ C. b) A complex-valued time series (Xt )t∈Z is said to be weakly stationary, if EXt = EXt′ for all t, t′ ∈ Z, E|Xt |2 < ∞ for all t ∈ Z and γX (r, s) := Cov (Xr , Xs ) depends only on the difference r − s. Then γX (r − s) := γX (r, s) is called the autocovariance function at lag r − s. If additionally γX (0) ̸= 0, then the autocorrelation function at lag h is defined by γX (h) ρX (h) :=. γX (0) c) A time series (Xt )t∈Z is strictly stationary, if all its finite dimensional distributions are shift-invariant, i.e. if d (Xt1 , Xt2 ,... , Xtn ) = (Xt1 +k , Xt2 +k ,... , Xtn +k ), for all n ∈ N and all t1 ,... , tn ∈ Z and all k ∈ Z. Remark 2.19. (a) If (Xt )t∈Z is real-valued and weakly stationary, then γX (h) = γX (−h). For a complex-valued weakly stationary time series we have γX (h) = γX (−h). This follows immediately from the fact that Cov (Y, X) = Cov (X, Y ). (b) If x1 ,... , xn ∈ C are realisations ofP a complex valued time series, then the empirical mean of x1 ,... , xn is defined by x := n nt=1 xt , the empirical autocovariance function by 1 (2.1) for h = 0,... , n − 1 and by γ b(−h) for h = −n + 1,... , −1, and the empirical b(h) := γ autocorrelation function again by (2.3). Convention 2.20. When we speak of stationary time series in this lecture, we shall always mean weakly stationary time. 20 Chapter 3 Data cleansing from trends and seasonal effects In this chapter we present some methods of how to estimate trend and season or how to eliminate them. Before speaking of these quantities, one should have a model behind. 3.1 Decomposition of time series We have seen that we should first identify trend and seasonality and other deterministic quantities. In order to speak about these, we should first have a model behind. This is done in the classical (additive) decomposition model: Definition 3.1. The classical (additive) decomposition model describes a time series (xt ) as a sum xt = m t + s t + y t , where (mt ) denotes the trend component, which is a slowly changing function (st ) denotes the seasonal component, which is a function with known period d (yt ) denotes a random noise component, which is often hoped to be a realisation of a stationary time series. The random noise component is responsible for the fluctuations of a time series. Sometimes, the trend component is further decomposed into a cyclic component corre- sponding e.g. to an economic cycle, and a pure trend component, which might be increasing or decreasing. We shall however not do this but assume the classical additive decomposi- tion model. There are also many other models possible, one could e.g. also think of multiplicative decomposition model given as xt = mt · st · yt 21 When all components are positive, taking the logarithm leads to log xt = log mt + log st + log yt , which is again additive. We shall not touch on these other models, but as said, throughout we shall assume the classical additive model as underlying. 3.2 Estimation of the trend in the absence of season- ality In this section we present some methods to estimate the trend when there is no seasonal component present. In the next section we shall then be concerned with the estimation of the seasonality when no trend is present, and in the last section with the estimation of both trend and season simultaneously. Assumption 3.2. The model for the time series (xt ) is given by xt = m t + y t , t = 1,... , T, where y1 ,... , yT is a “well-behaved” noise term, in particular it has expectation 0. The trend is denoted by (mt ) and is a “slowly changing” deterministic function. We shall consider the following methods for estimating / eliminating the trend of a given time series: (a) Linear regression (b) Polynomial regression (c) Moving average smoothing (d) Exponential smoothing (e) Differencing (f) Transformation of the data 3.2.1 Linear regression Most of you know linear regression from statistics courses. We assume here that an (affine-)linear trend is present. More precisely, the additional model assumption through- out Subsection 3.2.1 is 22 Assumption 3.3. The trend (mt ) is affine linear, i.e. of the form mt = αt + β, t = 1,... , T, for some α, β ∈ R. Definition 3.4. Under the model Assumptions 3.2 and 3.3, the linear regression chooses those α and β in R which fit best in a least squares sense, i.e. that satisfy T X (α, β) := argmin(a,b)∈R2 (xt − at − b)2. t=1 These α and β can be explicitly given as done in the following theorem: Theorem 3.5. Given data xt = mt + yt , t = 1,... , T , as above (with T ≥ 2), define the empirical mean T 1X xT = xt T t=1 and the time average T 1X 1 T (T + 1) T +1 tT := t= =. T t=1 T 2 2 Then the optimal solution (α, β) in the linear regression framework above is given by PT (t − tT )xt α := Pt=1 T t=1 (t − tT )2 and β := xT − αtT. You should know that result from an introductory statistics course. I give the proof only for completeness. Proof. Write T X g(a, b) := (xt − at − b)2 , a, b ∈ R. t=1 For minimizing over (a, b) we need to get the partial derivatives and set them equal to 0. So T ∂g X 0 = (a, b) = 2 (xt − at − b) · (−t), (3.1) ∂a t=1 T ∂g X 0 = (a, b) = −2 (xt − at − b). (3.2) ∂b t=1 23 Rearranging (3.2) leads to T 1X xT = b + a t = b + atT , (3.3) T t=1 and multiplying this by tT gives xT · tT = btT + a(tT )2. (3.4) Rearranging (3.1) we obtain T T 1X 1X 2 xt t = btT + a t, T t=1 T t=1 and substracting (3.4) from that we obtain T ! T 1X 2 2 1X a t − (tT ) = xt t − xT tT T t=1 T t=1 T T 1X 1X = (xt t − xt tT ) = (t − tT )xt. (3.5) T t=1 T t=1 Observe that T T 1X 2 2 1X t − (tT ) = (t − tT + tT )2 − (tT )2 T t=1 T t=1 T T 1X 2 2 X 1 = (t − tT ) + tT (t − tT ) + T (tT )2 − (tT )2 T t=1 T t=1 T | {z } =0 T 1 X = (t − tT )2. T t=1 Plugging this into (3.4) we get PT (t − tT )xt α = αopt = Pt=1 T , 2 t=1 (t − tT ) and plugging this again into (3.3) we get β = βopt = xT − αtT. Formally, we should still prove that there is indeed the global minimum (not only a local one, or not a maximum), but we leave that out. It is clear from the setting that a global minimum must exist, and there the partial derivatives must be 0. As we found only one point with partial derivatives being 0, this must be the global minimum. 24 Example 3.6. Recall the Dow Jones Utility Index data from Example 1.5 and Figure 1.4. Applying the linear regression as described above gives for the trend mt = 100 + 0.23t. Figure 3.1 shows the data together with the plotted trend. Figure 3.2 shows the residuals, i.e. yt = xt − mt. A simple model for the Dow Jones utility index could hence be xt = 100 + 0.23t + yt , and one could try to forecast x90 = 100 + 0.23 · 90 = 120.7. However, looking at the fit it does not really seem to be a good fit, there might still be room to find more structure in (yt ), or make a more general regression. 3.2.2 Polynomial or more general regression Again we assume Assumption 3.2, but Assumption 3.3 is replaced by the following: Assumption 3.7. (a) The trend (mt ) obeys a polynomial regression, i.e. there are β0 , β1 ,... , βp−1 ∈ R such that mt = β0 + β1 t + β2 t2 +... + βp−1 tp−1 , t = 1,... , T. Or, more generally: (b) The trend (mt ) obeys a regression of the form mt = β1 f1 (t) + β2 f2 (t) +... + βp fp (t), t = 1,... , T, where f1 : {1, 2,... , T } → R,... , fp : {1, 2,... , T } → R are given functions. Definition 3.8. Under the model Assumptions 3.2 and 3.7 (a), the polynomial regression chooses the β0 ,... , βp−1 which fit best in a least squares sense, i.e. T X (β0 ,... , βp−1 ) := argmin(b0 ,...,bp−1 )∈Rp (xt − b0 − b1 t −... − bp−1 tp−1 )2. t=1 Under the model Assumptions 3.2 and 3.7 (b), the general regression chooses the β1 ,... , βp which fit best in a least squares sense, i.e. T X (β1 ,... , βp ) := argmin(b1 ,...,bp )∈Rp (xt − b1 f1 (t) −... − bp fp (t))2. t=1 The polynomial regression is obviously a special case of the general regression by choosing fj (t) = tj−1. The linear regression is the special case of polynomial regression with p = 2. The solution of the general regression problem is given in the next theorem. Recall that the rank of a matrix denotes the maximal number of linearly independent columns, equiv- alently the maximal number of linearly independent rows. 25 Theorem 3.9. In the framework of general regression, suppose that T ≥ p and that rank(A) = p, where   f1 (1) f2 (1)... fp (1) A := .......  ∈ RT ×p ..  f1 (T ) f2 (T )... fp (T ) is the design matrix. Then A′ A ∈ Rp×p is invertible (here: A′ denotes the transpose of A) and the optimal solution to the regression problem is given by     β1 x1 ..  ′ −1 ′ .  .  = (A A) A .. . βp xT Again, the result should be known from a standard statistics course, but since I have it ready anyway I can also include it in these notes. Proof. Write T X g(b1 ,... , bp ) = (xt − b1 f1 (t) −... − bp fp (t))2. t=1 When the minimum is attained, the partial derivatives must be 0, so T ∂g X 0= (b1 ,... , bp ) = 2 xt − b1 f1 (t) −... − bp fp (t) (−fj (t)) ∀ j = 1,... , p. ∂bj t=1 This is equivalent to       f1 (1)... fp (1) b1 x1 (fj (1),... , fj (T )) · .....  · ..  = (f (1),... , f (T )) · ..  .  .  j j .  f1 (T )... fp (T ) bp xT for all j ∈ 1,... , p; to see this, simply multiply the matrix equation out and multiply it by 2; then one gets the above equation. The p equations above can again be rewritten as a single matrix equation of the form        f1 (1)... f1 (T ) f1 (1)... fp (1) b1 f1 (1)... f1 (T ) x1 ....  .... ·..  = ....  .. . .. .. .  .. .  fp (1)... fp (T ) f1 (T )... fp (T ) bp fp (1)... fp (T ) xT With the definition of the design matrix this can be rewritten as     b1 x1 A′ A ...  = A′ ... .     bp xT 26 Hence, if A′ A is invertible, we get the desired form for the least squares estimator in the general regression. That we have indeed a global and unique minimum is checked by an elementary curve discussion. So we only have to verify that A′ A is invertible. Suppose not. Then there is c ∈ Rp \ {0} such that A′ Ac = 0. Multiplying from the left by c′ we obtain (with ⟨·, ·⟩ the standard Euclidian product and | · | the norm in RT ) 0 = c′ A′ Ac = ⟨Ac, Ac⟩ = |Ac|2. Hence we conclude that Ac = 0. But since rank(A) = p ≤ T , it is known from linear algebra that A is injective. This is a contradiction to the existence of c ∈ Rp \ {0} with Ac = 0. Hence A′ A must be invertible. Applying this to polynomial regression we obtain: Corollary 3.10. In the framework of polynomial regression, suppose that T ≥ p, and let   1 1... 1  1 2... 2p−1  A := ...... .   ...  p−1 1 T... T Then A′ A is invertible and the optimal solution to the polynomial regression problem is given by     β0 x1 ..  ′ −1 ′ .  .  = (A A) A .. . βp−1 xT Proof. The submatrix   1 1... 1  1 2... 2p−1  B :=   ∈ Rp×p  ...... ...  p−1 1 p... p is a Vandermond matrix and hence has determinant Y det B = (i − j) ̸= 0; 1≤i 1 it is given by ∞ X Xt = − φ−j 1 Zt+j , t∈Z j=1 (the sum converging almost surely absolutely and in L2 ). When |φ1 | < 1, the ACVF of X is given by ( 2 σ h 1| 2 φ1 , h ∈ N0 , γX (h) = 1−|φσ 2 |h| 1−|φ1 |2 1 φ , −h ∈ N, while for |φ1 | > 1 it is given by ( σ2 φ −h , |φ1 |2 −1 1 h ∈ N0 , γX (h) = σ 2 φ h, |φ1 |2 −1 1 −h ∈ N. Proof. (a) Consider first the case that |φ1 | < 1.PTo show the existence andPform of the weakly stationary solution, define Xt by Xt := ∞ k k=0 φ1 Zt−k , t ∈ Z. Since ∞ j j=0 |φ1 | = 1 1−|φ1 | < ∞ as a consequence of |φ1 | < 1, the sequence (φj1 )j∈N0 is indeed a linear filter (define the filter coefficients as 0 for negative indices). Then X is weakly stationary by Corollary 5.4 with ACVF given for h ∈ N0 by ∞ ∞ X X σ2 γX (h) = σ 2 φj1 φj−h 1 = σ 2 φh1 |φ1 |2j = φh , j=h j=0 1 − |φ1 |2 1 and for h < 0 we use γX (h) = γX (−h). To see that X thus defined satisfies indeed the AR(1) equation observe that ∞ X ∞ X Xt − φ1 Xt−1 = φk1 Zt−k − φ1 φk1 Zt−1−k k=0 k=0 ∞ X = Zt−0 + (φk1 − φk1 )Zt−k = Zt , k=1 66 so that X is indeed a solution of (6.3). To see the uniqueness of the solution, suppose that X = (Xt )t∈Z is some weakly stationary solution of (6.3), so Xt = Zt + φ1 Xt−1 ∀ t ∈ Z. Iterating this again and again, we obtain Xt = Zt + φ1 Xt−1 = Zt + φ1 (Zt−1 + φ1 Xt−2 ) = Zt + φ1 Zt−1 + φ21 Xt−2 = Zt + φ1 Zt + φ21 (Zt−2 + φ1 Xt−3 ) = Zt + φ1 Zt−1 + φ21 Zt−2 + φ31 Xt−3 =... XN = φk1 Zt−k + φN1 +1 Xt−N −1 (6.4) k=0 PN k for any N ∈ N. Letting N P∞ k → ∞ we see that k=0 φ1 Zt−k converges (in mean square and almost surely) to k=0 φ1 Zt−k , and that φN 1 +1 Xt−N −1 converges in mean square to 0, since 2 E φN 1 +1 Xt−N −1 = |φ1 |2(N +1) · E|Xt−N −1 |2. | {z } | {z } → 0 since |φ1 | < 1 bounded since X weakly stat. P∞ It follows that Xt = k=0 φk1 Zt−k , in particular there is only one solution. (b) When |φ1 | > 1, we can rewrite (6.3) as 1 1 Xt−1 = − Zt + X t. φ1 φ1 Iterating this we obtain 1 1 Xt = − Zt+1 + Xt+1 φ1 φ1 1 1 1 = − Zt+1 − 2 Zt+2 + 2 Xt+2 φ1 φ1 φ1 =... N X 1 = − φ−k 1 Zt+k + N Xt+N. k=1 φ1 Letting N → ∞ this converges to ∞ −k −1 P P∞ −k(observe that |φ1 | < k=1 φ1 Zt+k , giving uniqueness 1) and a candidate for the solution. Now define Xt := k=1 φ1 Zt+k. This is weakly stationary by Corollary 5.4, and similarly as in (a) it is checked that it defines indeed a solution of (6.3). Again by Corollary 5.4, its ACVF is given for h ∈ N0 by ∞ X X X 1 γX (h) = σ 2 φj1 φ1 j−h 2 = σ φ1 −h 2j 2 |φ1 | = σ φ1 −h −2 |φ1 | j 1 it is given by Xt = − φθ11 Zt − ∞ −j−1 P j=1 (φ1 + θ1 )φ1 Zt+j. (b) Suppose that |φ1 | = 1. Then the ARMA(1,1) equation (6.5) admits a stationary so- lution if and only if φ1 = −θ1 , in which case one such solution is given by Xt = Zt. If the underlying probability space (Ω, F, P ) on which (Zt )t∈Z is defined is rich enough to support a Bernoulli distributed random variable U with parameter 1/2 that is independent of (Zt )t∈Z , then also Xt := Zt + φt1 (2U − 1) is a stationary solution of (6.5), in particular, the stationary solution is not unique. 68 Proof. Much of the proof can be done in a similar way as the proof of Theorem 6.5, but we give another one using the characteristic polynomials, which paves the way for the more general case later. The characteristic polynomials are given by φ(z) = 1 − φ1 z and θ(z) = 1 − θ1 z, so that the ARMA(1,1) equation reads φ(B)Xt = θ(B)Zt. (a) (i) Let |φ1 | < 1. To show uniqueness, let X = (Xt )t∈Z be a stationary solution. 1 We would like to solve this equation and apply the filter function S 1 ∋ z 7→ φ(z) to it. But is Pthat actually a filter function, i.e. does there exist a liner filter (ψj )j∈Z such that 1 ∞ j 1 φ(z) = j=0 ψj z for all z ∈ S ? The answer here is yes, and by using the geometric series we obtain that ∞ 1 1 X = = φj1 z j , φ(z) 1 − φ1 z j=0 which converges absolutely for z ∈ C with |z| < 1/|φ1 |, in particular for z ∈ S 1. So now if X is a stationary solution of φ(B)Xt = θ(B)Zt , we can apply the linear filter 1 described by the filter function S 1 ∋ z 7→ φ(z) on both sides of the equation and obtain from Theorem 5.14 0 1 1 θ Xt = B Xt = φ (B) Xt = (B)(φ(B)Xt ) = (B)Zt. φ φ | {z } φ θ(B)Zt The right hand side of this equation is uniquely determined from the characteristic poly- nomials and Z, so that the solution is unique if existent. To see the existence, we define Xt := φθ (B)Zt. We have already seen that S 1 ∋ z 7→ φ(z) 1 is a filter function, and since S 1 ∋ z 7→ θ(z) is one and since the product of two filter θ(z) functions is again a filter function by Theorem 5.14, so is S 1 ∋ z 7→ φ(z). This shows that X = (Xt )t∈Z is weakly stationary, and from Theorem 5.14 we obtain that θ φθ φ(B)Xt = φ(B) (B)Zt = (B)Zt = θ(B)Zt , φ φ so that X solves the ARMA(1,1) equation. Finally, observe that for z ∈ S 1 ∞ ∞ θ(z) X X = (1 + θ1 z) φj1 z j = z 0 + (φ1 + θ1 )φj−1 j 1 z. φ(z) j=0 j=1 P∞ This shows that Xt = φθ (B)Zt = Zt + j=1 (φ1 + θ1 )φj−1 1 Zt−j. (ii) Now let |φ1 | > 1. We would like again to use the same trick, but we must assure 1 1 ourselves that S 1 7→ φ

Time Series Analysis Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript