Time Series Data Overview PDF
Document Details
Uploaded by WittyAloe
Norwegian University of Life Sciences
Tags
Summary
This document provides an overview of time series data concepts, including univariate and multivariate time series, trend, seasonality, data representation, and dimensionality. It's a lecture or presentation on time series analysis techniques.
Full Transcript
Time series data 3 Norwegian University of Life Sciences Time series versus random samples A random sample {x1 ,... , xn } fulfills the i.i.d. property, i.e. ↭ statistical independence ↭ identical distribution Time series data are, by default, not independent, nor identically d...
Time series data 3 Norwegian University of Life Sciences Time series versus random samples A random sample {x1 ,... , xn } fulfills the i.i.d. property, i.e. ↭ statistical independence ↭ identical distribution Time series data are, by default, not independent, nor identically distributed! Example: violation of the i.i.d property ↭ temperature measurements on two consecutive days xt , xt+1 are correlated across the year (not independent) ↭ daily average temperatures change between seasons (distinct distributions) 5 Norwegian University of Life Sciences Univariate and multivariate time series Notation for univariate time series By default, we consider a univariate time series as a vector (xt )t→T , where ↭ T = {tmin , (tmin + 1),... , tmax } is denoted as index set, and ↭ xt → R for all t → T Notation for multivariate time series ! " (1) (n) For a multivariate time series (xt )t→T , it holds that xt = xt ,... , xt → Rn for some n > 1, instead. 8 Norwegian University of Life Sciences Trend and Seasonality 13 Norwegian University of Life Sciences Trend and seasonality Often, time series are described as a sum (or product) of 3 components: ↭ a smooth, non-periodic function over time, indicating the systematic variation over time (trend) ↭ a periodic function, indicating the recurring behavior over time (seasonality) ↭ a (time-independent) random noise term (a) Full time series (b) Trend (c) Seasonality (d) Noise Figure 7: Trend, seasonality and noise are characteristics of time series. 14 Norwegian University of Life Sciences Data Representation 16 Norwegian University of Life Sciences Representation of time series For time series data, the following transforms are common for preprocessing: ↭ Power Transform (Box-Cox) ↭ Di!erence Transform ↭ Standardization / Normalization ↭ Smoothing ↭ Principal Component Transformation (Principal Component Analysis) Details follow in Section Transformations and [Mills, 2019, ch. 2]. 20 Norwegian University of Life Sciences Dimensionality and time axis 28 Norwegian University of Life Sciences The time axis ↭ numerical or (ordered) categorical time axis? ↭ if numerical, is the time axis continuous or discrete? ↭ determine time domain T = [tmin , tmax ] ↭ regular vs irregular time series ↭ if regular determine !t ↭ resolution of the time axis ↭ assess resolution of the time axis w.r.t. modeling goal ↭ too fine: long time lags (high model complexity), micro-structure ↭ too coarse: insu!cient level of detail 30 Norwegian University of Life Sciences Distribution metrics 32 Norwegian University of Life Sciences Summary statistics Statistics summarizing the (univariate) empirical distribution(s), assuming !t = 1: tmax # 1 Arithmetic mean x̄ = tmax ↑tmin +1 xt t=tmin tmax # 1 Variance ω2 = tmax ↑tmin (xt ↑ x̄)2 t=tmin Median q0.5 ({xt : t → T }) Inter-quartile range q0.75 ({xt : t → T }) ↑ q0.25 ({xt : t → T }) Minimum min (xt ) t→[tmin ,tmax ] Maximum max (xt ) t→[tmin ,tmax ] Empirical moments skewness, kurtosis, etc. 33 Norwegian University of Life Sciences Backshift operator (lag) ↭ time series tools require moving forward/backward in time ↭ backshift operator (lag operator) B : R ↓ R, B (xt ) ↓ xt↑1 , t > 1 ↭ (integer) powers of B: ↭ B k (xt ) = xt→k for k → N and t > k ↭ B →1 (xt ) = xt+1 for t < tmax ↭ B →k (xt ) = xt+k for k → N and t ↔ tmax ↑ k B2 B ↑1 B xt↑2 xt↑1 xt xt+1 time 34 Norwegian University of Life Sciences Autocorrelation ↭ autocorrelation = correlation between a time series and its own lagged version, acfxt : N ↓ [↑1, 1], acfxt (k) = cor (xt , xt↑k ) = cor(xt , B k (xt )) for t > k ↭ linear dependence between elements of (xt )t→T at distinct t → T k=1 2 3 4 xt↑5 xt↑4 xt↑3 xt↑2 xt↑1 xt time 35 Norwegian University of Life Sciences Cross-Correlation ↭ cross-correlation = correlation between a time series and the lagged version of another time series, ccfxt ,yt : Z ↓ [↑1, 1] ccfxt ,yt (k) = cor(xt , yt↑k ) = cor(xt , B k (yt )) for t > k ↭ linear dependence between xt and yt at distinct t → T xt↑5 xt↑4 xt↑3 xt↑2 xt↑1 xt k=1 2 3 4 yt↑5 yt↑4 yt↑3 yt↑2 yt↑1 yt time 36 Norwegian University of Life Sciences Missing values in time series 2 Norwegian University of Life Sciences Missing values—di!erent ways of missing ↭ Missing values cannot be handled by many statistical and machine learning models ↭ First studied systematically by Rubin ↭ 3 types of missing values MCAR: Missing completely at random Completely unrelated to process of interest, e.g., data transmission error between thermometer and database MAR: Missing at random Probability of miss governed by other process, e.g., thermometer fails more often in high humidity: Missingness conditioned on humidity entirely random MNAR: Missing not at random Probability of miss related to process studied, e.g., miss more likely after prolonged cold spell ↭ See also Dong and Peng , Mack et al. , and Wikipedia 3 Norwegian University of Life Sciences Handling missing values ↭ Option 1: Remove missing values ↭ usually not possible for time series, since regular time axis is required ↭ Option 2: Replace missing values ↭ replace by fixed value (constant) or global distribution mean / median / etc. ↭ interpolate by non-missing neighbors (carry-forward, carry-backward) ↭ replace by rolling mean / median / weighted mean ↭ linear interpolation ↭ spline interpolation ↭ interpolation using forecasting models (→ Section Forecasting) ↭ See Moritz and Bartz-Beielstein for more information ↭ See interpolation.Rmd for code for following examples 6 Norwegian University of Life Sciences Preprocessing time series data ↭ Goal: represent data in a "standardized" format and allow / facilitate further processing ↭ Aspects to consider: ↭ Transform to same scale → Standardization ↭ Remove skewness in data distribution → Power transform ↭ Remove trends → Di!erencing ↭ Reduce measurement noise → Smoothing ↭ Handle missing values → Imputation 5 Norwegian University of Life Sciences Global missing value replacement ↭ Default values (often, 0 or 1) may be used for replacement ↭ Global mean / median might be computed from the data ↭ Can use random data ↭ Problem: time dynamic is not taken into account, e.g., trend or seasonality 150 150 150 Value Value Value 100 100 100 50 50 50 0 0 0 0 50 100 150 0 50 100 150 0 50 100 150 Time Time Time Data Data Data Density Density Density 0.100 0.100 0.03 0.075 0.050 Imputed 0.075 Imputed 0.02 Imputed 0.025 0.050 0.025 0.01 0.000 0.000 0.00 0 50 100 150 Original 0 50 100 150 Original 0 50 100 150 Original Value Value Value (a) Global mean (b) Global median (c) Random 8 Norwegian University of Life Sciences Local missing value replacement—Rolling average 150 150 150 Value Value Value 100 100 100 50 50 50 0 0 0 0 50 100 150 0 50 100 150 0 50 100 150 Time Time Time Data Data Data Density Density Density 0.03 0.03 0.03 0.02 Imputed 0.02 Imputed 0.02 Imputed 0.01 0.01 0.01 0.00 0.00 0.00 0 50 100 150 Original 0 50 100 150 Original 0 50 100 150 Original Value Value Value (a) Plain rolling average (b) Linear weighted average (c) Exponentially weighted (ω = 2) (ω = 2) average (ω = 2, ε = 0.5) 13 Norwegian University of Life Sciences Linear vs spline interpolation 150 200 Value Value 100 100 50 0 0 −100 0 50 100 150 0 50 100 150 Time Time Data Data Density Density 0.03 0.03 0.02 Imputed 0.02 Imputed 0.01 0.01 0.00 0.00 0 50 100 150 Original −100 0 100 200 Original Value Value (a) Linear interpolation (b) Spline interpolation Check the scales! 20 a careful look at the scales in the left and right Norwegian Take plot! University of Life Sciences Standardization & Normalization 7 Norwegian University of Life Sciences Standardization & Normalization ↭ Normalization = linear transformation to improve "comparability" of data scales Standardization x→t = xt ↑x̄ ω (mean: 0, standard deviation: 1) xt ↑median(xt ) Robust standardization x→t = iqr(xt ) xt ↑min(xt ) Min-max normalization x→t = max(xt )↑min(xt ) (all values in [0, 1]) 8 Norwegian University of Life Sciences Power Transform 10 Norwegian University of Life Sciences Power Transform ↭ In decompositions, noise components ωt are often assumed to be Gaussian, i.e., ωt ↑ fε = N (0, ε 2 ) ↭ Assumption requires that fω is symmetric ↭ What if fω is asymmetric? ↭ A common choice for right-skewed data to obtain a Gaussian-like distribution: x→t = log xt Skewed time series Density 5 1.00 4 0.75 3 0.50 2 1 0.25 0 0.00 0.0 2.5 5.0 7.5 10.0 0 1 2 3 4 5 11 Norwegian University of Life Sciences Power Transform: Log transformation ↭ log-transform works well for data which are ↭ strictly positive (xt > 0 for all t) ↭ (approximately) log-normal distributed: X ↑ lognorm(µ, ε 2 ) ↓↔ log X ↑ N (µ, ε 2 ) Skewed t.s. after log transform Density 1 0.6 0 0.4 −1 0.2 −2 −3 0.0 0.0 2.5 5.0 7.5 10.0 −3 −2 −1 0 1 12 Norwegian University of Life Sciences Advanced Power Transforms ↭ Stronger/weaker deviations from a centered distribution, or left-skewed distributions =↔ Box-Cox transform ↭ Values that are not strictly positive =↔ Yeo-Johnson transform ↭ See also Hyndman and Athanasopoulos [2021, Ch. 3.1] and Mills [2019, Ch. 2] ↭ Box-Cox is a generalization of the log-transform with parameter ϑ xωt ↑1 if ϑ ↗= 0 x→t = ϑ log xt otherwise ↭ How to determine ϑ? ↭ Idea: measure deviation from a “perfect” normal distribution ↭ Profile likelihoods ↭ Goodness-of-fit test 13 Norwegian University of Life Sciences Di!erencing 17 Norwegian University of Life Sciences Di!erencing ↭ How about a discrete time axis (index set T )? ↭ We have ϖ xt ↘ xt↑!t xt = lim ϖt !t↓0 !t ↭ A (finite) approximation to the derivative is given by the first-order di!erences ϖ xt ↘ xt↑!t xt ≃ for small !t ϖt !t ↭ On a discrete axis, the di!erence operator is therefore given as D(xt ) = xt ↘ B(xt ) = xt ↘ xt↑1 for t > 1 ↭ See also Mills [ch. 2] and Hyndman and Athanasopoulos [ch. 9.1] 21 Norwegian University of Life Sciences Smoothing: Rolling Average 26 Norwegian University of Life Sciences Smoothing: Rolling Average ↭ Trends and seasonalities are often hidden under measurement noise ↭ Smoothing helps to stabilize the data distribution locally (reduce variance of error terms) ↭ Rolling average (moving average): calculate average over k local neighbors (ϱ = k↑1 2 ): x→t = µ(xt↑ϖ ,... , xt+ϖ ) for t ⇐ {ϱ + 1,... , tmax ↘ ϱ} ↭ see Hyndman and Athanasopoulos [2021, ch. 3.3] 27 Norwegian University of Life Sciences Principal Component Analysis 2 Norwegian University of Life Sciences Dimensionality Reduction ↭ Multivariate data are di!cult to handle 20 15 S1 ↭ curse of dimensionality 10 5 ↭ redundant information Measurement 0 20 ↭ Approach: Dimensionality reduction 15 S2 10 ↭ Straightforward technique: Principal 5 0 Component Analysis (PCA) 20 ↭ Goals 15 S3 10 ↭ Identify highly correlated aspects in 5 0 time series, e.g., temperatures in 0.0 2.5 5.0 7.5 10.0 neighboring cities Time [a.u.] ↭ Identify noise components ↭ Focus on components contributing Figure 1: Multivariate time series information 3 Norwegian University of Life Sciences Dimensionality Reduction using PCA ↭ Sort PCs in decreasing order of variance ↭ Keep first d PC ↭ Let L(d) → RN →d be the matrix with the first d columns of L, i.e., the first d PCs ↭ Let S(d) → R|T |→d be the corresponding scores ↭ Alt A: Perform further analysis steps on S(d) ↭ Alt B: Transform back to X ↑ = S(d) LT(d) ↭ How to choose d? Figure 5: Source: ↭ Look for regime change in scree plot Wikipedia/Jesslynn34 under ↭ Set criterium ω → [0, 1] for variance explained, i.e., CC-BY-SA choose !d the smallest ! d with k=1 ε k ↑ ω k = 1N εk 9 Norwegian University of Life Sciences STL decomposition 12 Norwegian University of Life Sciences STL decomposition ↭ STL = seasonal and trend decomposition using LOESS (LOcal RegrESSion) ↭ Pronounced Lo-Ess ↭ Robust decomposition technique ↭ Seasonality may change over time ↭ Additive decomposition of the time series xt = st + ϑt + rt ↭ st... Seasonality ↭ ϑt... Trend ↭ rt... Remainder (residuals, noise) ↭ Invented by Cleveland et al. (1990) (highly readable!) ↭ See also Hyndman and Athanasopoulos (2021, ch. 3.6) 13 Norwegian University of Life Sciences STL decomposition: Choosing parameters Seasonality window is required parameter, no default ↭ must be odd ↭ available values are one per period, must be taken into account when choosing Trend window automatic choice not always best ↭ must be odd ↭ applies to all points in a series ↭ to span three years in series with daily data, need 3 ↓ 365 = 1095 wide window R implementation Function stl() ↭ requires timeseries without missing data (ts object) ↭ takes period window from frequency of time series More info See Cleveland et al. (1990) for discussion of parameter choices Examples 17 See transformations_part2.Rmd Norwegian University of Life Sciences Fourier Transforms 6 Norwegian University of Life Sciences Fourier Transforms for Periodic Processes ↭ Describe time-varying functions in frequency space ↭ Compose functions in time as sums of sine and cosine functions with di!erent frequencies ↭ The weights (loadings) assigned to the di!erent sines and cosines determine the shape of the composed function ↭ Widely used in signal processing, image analysis, etc ↭ Fast Fourier Transform (FFT) is key algorithm ↭ Can be used in data analysis to identify (and remove) periodic components (see example) ↭ Combine with ARIMA to handle time series with periodicity (see Hyndman and Athanasopoulos (2021, Ch. 7.4, 10.5)) ↭ For technical hints, see my FFT_MATLAB.pdf note 7 Norwegian University of Life Sciences Stochastic Processes 8 Norwegian University of Life Sciences Stochastic Processes ↭ A stochastic process is a sequence of random variables observed over time t {Xt : t → T } Stochastic process vs. time series ↭ like for time series, T denotes the (continuous or discrete) time domain ↭ unlike a single (observed) time series (xt )t→T , {Xt : t → T } describes a full probability distribution at each time point t→T 9 Norwegian University of Life Sciences Stochastic Processes ↭ a single observed time series (xt )t→T can be interpreted as a sample from a stochastic process {Xt : t → T } (trajectory) ↭ major characteristics of stochastic processes include: ↭ mean function µ : T ↑ R, µ(t) = E(Xt ) ↭ (co)variance function ω : T ↓ T ↑ R+ , ω(t, t) = Var(Xt ) and ω(t, t + ε ) = Cov(Xt , Xt+ω ) 12 Norwegian University of Life Sciences Stationarity 14 Norwegian University of Life Sciences Stationarity Stationarity A stochastic process {Xt : t → T } is (strictly) stationary, if ↭ all its marginal distributions are equal over time, i.e. fXt1 ,Xt2 ,...,Xtn = fXt1 +ω ,Xt2 +ω ,...,Xtn +ω for all t1 ,... , tn → T and ε → R+ ω1 ω2 ω3 ω4 X1 X2 X3 X4... time 15 Norwegian University of Life Sciences Stationarity Stationarity A stochastic process {Xt : t → T } is (weakly) stationary, if ↭ its expected value is constant over time, i.e. E(Xt ) = µ for all t→T ↭ its autocovariance is constant over time, i.e. Cov(Xt , Xt+ω ) = ωω2 for all t → T and all ε → R+ 16 Norwegian University of Life Sciences Gaussian White Noise ↭ simplest stochastic process: White Noise process ↭ mean E(Xt ) = 0 for all t → T ↭ independence over time: Xt , Xt+ω independent for all t → T , ε → R+ ↭ special case: Gaussian White Noise process ↭ Xt ↔ N (0, ω 2 ) ↭ equals to an ARIM A(0, 0, 0) model ↭ see (Hyndman and Athanasopoulos, 2021, ch. 2.9) and (Mills, 2019, ch. 3) 19 Norwegian University of Life Sciences Forecasting models as stochastic processes ↭ ARIMA and ETS models represent stochastic processes ↭ ARIM A(0, 1, 0) is a random walk with variable step length Example: ARIM A(0, 1, 0) as a random walk xt ↗ xt↑1 = ϑt xt = xt↑1 + ϑt ↭ xt↑1... previous position from step t ↗ 1 ↭ ϑt... direction and length in step t 26 Norwegian University of Life Sciences Types of stochastic processes 27 Norwegian University of Life Sciences Types of stochastic processes ↭ 4 major classes of stochastic processes discrete time continuous time discrete values discrete chain discrete process continuous values continuous chain continuous process Table 2: Types of stochastic processes ↭ Gaussian White Noise: continuous chain or process ↭ Random Walk: discrete chain ↭ ARIMA: continuous chain 28 Norwegian University of Life Sciences Markov Chains 30 Norwegian University of Life Sciences Markov Property ↭ special property of stochastic processes: {Xt } has the Markov property (of order 1), if P (Xt |Xt↑1 , Xt↑2 ,... , X1 ) = P (Xt |Xt↑1 ) for any t ↭ all information about Xt from historical data is contained in Xt→1 ↭ no additional information about Xt is contained in Xt→2 ,... , X1 ↭ "Xt is conditionally independent of Xt↑2 ,... , X1 given Xt↑1 " 31 Norwegian University of Life Sciences Notations ↭ stochastic processes are random variables (capital ↭ time series are observations letters) Xt : (non-capital letters) xt : ↭ moments E(Xt ), var(Xt ), ↭ statistical quantities cov(Xt , Xt+k ), etc. mean(x), etc. ↭ marginal distributions / ↭ empirical distributions via densities fXt (x) histogram, kernel density ↭ joint distributions / estimate, etc. densities fXt ,Xt+k (x, y) 36 Norwegian University of Life Sciences Markov chain: Evolution in time Time-homogeneous Markov chains ↭ A Markov chain is called time-homogeneous, if its one-step transition probabilities are independent of t, i.e. P (Xt+1 |Xt ) = P (Xs+1 |Xs ) for all s, t → T ↭ Example: random walk 8 Norwegian University of Life Sciences Markov chains over categorical variables States (categories) Xt → C for C = {c1 ,... , cn } (just 1,... , n below for brevity) Transition matrix (time-homogeneous chain) A i , j = P (Xt+1 = j | Xt = i) !"#$ ! "# $ ! "# $ !"#$ from to to from (t) State distribution vector ω (t) → [0, 1]n ↑↓ ωi = P (Xt = i) (0) Initial state distribution vector ω (0) → [0, 1]n ↑↓ ωi = P (X0 = i) ↭ State vector contains probabilities of mutually exclusive events ↭ Markov chain must be in one of these states at any step % (t) % ↓ ni=1 ωi = ni=1 P (Xt = i) = 1 10 Norwegian University of Life Sciences Markov chain: Stochastic matrix ↭ Transition matrix Aij = P (Xt+1 = j|Xt = i) ↭ All Aij are probabilities =↓ Aij → [0, 1] ↭ Rows of A contain probabilities of mutually exclusive transitions out of a given state, of which one must be chosen (“out of” here allows staying “in place”) % % ↭ =↓ nj=1 Aij = nj=1 P (Xt+1 = j|Xt = i) = 1. ↭ A is a (right) stochastic matrix (see Wikipedia) ↭ In line with Visser and Speekenbrink (2022) and Fink (2014) we will work with right stochastic matrices ↭ Thus, we treat ω as a row vector ↭ For a left stochastic matrix, rows and columns trade roles and ω would be a column vector 11 Norwegian University of Life Sciences Time evolution of state distribution ↭ At time t, the state distribution is ω (t) ↭ Then, the probability to be in state j at time t + 1 is n & (t+1) (t) ωj = ωi Aij i=1 ↭ In matrix notation if ω is a row vector (notation used by Visser and Speekenbrink (2022)) ω (t+1) = ω (t) ↔ A ↭ Note that we multiply from the right with the transition matrix, because it is a right stochastic matrix 12 Norwegian University of Life Sciences Markov chain: time evolution ↭ k-step transition probability A(k) can be computed via matrix multiplication: P (Xt+k |Xt ) = A(k) = A ! ↔ A ↔"#A · · · ↔ A$ = Ak k times ↭ Marginal probability distribution of state Xt at time t P (Xt ) = ω (t) = ω (0) A(t) = ω (0) At 14 Norwegian University of Life Sciences Estimating Markov model parameters from data ↭ We assume that we know the number of states n ↭ The Markov model is then given by ω (0) and A ↭ ω (0) has n elements ↭ A has n2 elements ↭ ω (0) and all rows of A are normalized (sum 1) ↓ n ↗ 1 free parameters in ω (0) ↓ n2 ↗ n = n(n ↗ 1) free parameters in A ↓ In total (n + 1)(n ↗ 1) = n2 ↗ 1 free parameters ↭ Use maximum-likelihood method to estimate ω̂ (0) and  given a time series x1 , x2 ,... , xt→1 , xt , xt+1 ,... 20 Norwegian University of Life Sciences Mixture Models 22 Norwegian University of Life Sciences Mixture models ↭ Data often not distributed according to a “simple” distribution such as normal,... ↭ But in certain cases one may have a mixture of multiple distributions ↭ Temperature data may be weighted sum of normal distributions f (T ) = ω1 f1 (T ) + ω2 f2 (T ) = ω1 N (T ; µ1 , ε2 ) + ω2 N (T ; µ2 , ε2 ) with ωi → [0, 1] and ω1 + ω2 = 1 ↭ Need to estimate five parameters (why not six?) ↭ See Visser and Speekenbrink (2022, Chs.2–3) for details and DAT320_mixture_model.Rmd for an example Distribution of mean daily temperatures at Ås 1874−2023 0.05 0.04 0.03 Density 0.02 0.01 23 0.00 −20 −10 0 10 Norwegian University of Life Sciences 20 Temperature [C] Hidden Markov Models 24 Norwegian University of Life Sciences Hidden Markov Models ↭ Parameters in a Hidden Markov Model ω = (A, B, ε (0) ) ↭ X1 ,... , Xtmax is a Markov chain of hidden states with parameters (A, ε (0) ) ↭ Y1 ,... , Ytmax are observable variables ↭ Yt is dependent on Xt with emission probabilities B Y1... Yt→1 Yt Yt+1... observable B ε (0) X1... Xt→1 Xt Xt+1... hidden A time 3 Norwegian University of Life Sciences Hidden Markov Model parameters ↭ Hidden Markov Model (A, B, ω (0) ) ↭ Model parameters: ↭ n2 transition probabilities A (n2 ↗ n free) ↭ n initial state probabilities ω (0) (n ↗ 1 free) ↭ nm emission probabilities B (nm ↗ n free) ↭ total: n(n + m + 1) parameters (n(n + m ↗ 2) free) ↭ Model parameters estimation not trivial ↭ We will look at several algorithms next week 30 Norwegian University of Life Sciences HMM Estimation: What is the problem? ↭ Exposition follows Visser and Speekenbrink (2022, Ch. 4) ↭ Time starts at t = 1 ↭ St represents the state of the Markov chain instead of Xt ↭ We assume for now that the observed variables may come from a continuous distribution ↭ Express further explicitly that initial, transition and observation probabilities depend on model parameters ϑ = (ϑpr , ϑobs , ϑtr ) ↭ HMM defined by Initial state distribution P (S1 |ϑpr ) → ε (1) Observation density f (Yt |St , ϑobs ) → B Transition distribution P (St |St→1 , ϑtr ) → A 7 Norwegian University of Life Sciences HMM Estimation: What is the problem? ↭ Model definition Initial state distribution P (S1 |ϑpr ) → ε (1) Observation density f (Yt |St , ϑobs ) → B Transition distribution P (St |St→1 , ϑtr ) → A ↭ Probability of a specific sequence of observations and states T ! f (Y1:T , S1:T |ϑ) = P (S1 |ϑpr )f (Y1 |S1 , ϑobs ) P (St |St→1 , ϑtr )f (Yt |St , ϑobs ) t=2 "T # ! = f (Yt |St , ϑobs )P (St |St→1 , ϑtr ) f (Y1 |S1 , ϑobs )P (S1 |ϑpr ) t=2 ↭ Probability of a specific sequence of observations $ f (Y1:T |ϑ) = f (Y1:T , S1:T |ϑ) 8 S1:T ↑S T Norwegian University of Life Sciences HMM Estimation: What is the problem? ↭ Probability of a specific sequence of observations and states "T # ! f (Y1:T , S1:T |ϑ) = f (Yt |St , ϑobs )P (St |St→1 , ϑtr ) f (Y1 |S1 , ϑobs )P (S1 |ϑpr ) t=2 ↭ Probability of a specific sequence of observations $ f (Y1:T |ϑ) = f (Y1:T , S1:T |ϑ) S1:T ↑S T ↭ Likelihood of model for observations y1:T : L(ϑ|y1:T ) = f (ϑ|y1:t ) % % ↭ Problem: %%S T %% = N T ↭ Learning example: N = 2, T = 49 ↑ N T = 249 ↓ 5.6 ↔ 1014 9 Norwegian University of Life Sciences Algorithms for HMMs 2 Norwegian University of Life Sciences Hidden Markov Models: Estimation problems Evaluation Estimate probability to observe y1 ,... , yT given HMM (A, B, ω (0) ) ↭ Example: choose word by selecting HMM with highest probability ↭ Forward algorithm Decoding Estimate hidden state sequence x1 ,... , xt given HMM parameters (A, B, ω (0) ) and observations y1 ,... , yt ↭ Example: Determine when test persons mastered the task ↭ Viterbi algorithm Training Estimate HMM parameters (A, B, ω (0) ) given observation y1 ,... , yT ↭ Infer complete hidden Markov model from observations ↭ Forward-backward algorithm (Baum-Welch algorithm) ↭ Also known as generalized expectation maximization (EM) algorithm See Visser and Speekenbrink (2022, Ch. 4), Fink (2014, Ch. 5), Rabiner (1989), and for a historical account Liang (2023) 6 Norwegian University of Life Sciences Forward algorithm: States involved in subsequent steps ↭ ωi(1) = f (Y1 = y1 , S1 = i|ε) ↭ ωi(2) = f (Y1 = y1 , Y2 = y2 , S2 = i|ε) ↭ ωi(t) = f (Y1 = y1 ,... , Yt = yt , St = i|ε) ϑ(t) Y1 Y2... Yt→1 Yt Yt+1... observable B ω (1) S1 S2... St→1 St St+1... hidden A time 12 Norwegian University of Life Sciences Backward algorithm: states involved ↭ εi(T ) = 1 ↭ εi(tmax →1) = P (Ytmax = ytmax |Stmax →1 = i, ε) ↭ εi(t) = P (Ytmax = ytmax ,... , Yt+1 = yt+1 |St = i, ε) ϖ (t) Y1... Yt→1 Yt Yt+1... YT →1 YT observable B ω (1) S1... St→1 St St+1... ST →1 ST hidden A time 17 Norwegian University of Life Sciences Baum-Welch algorithm ϱ (t) ϑ(t) ϖ (t) Y1... Yt→1 Yt Yt+1... YT observable B ω (0) X1... Xt→1 Xt Xt+1... XT hidden A time 26 Norwegian University of Life Sciences Overview over HMM algorithms HMM algorithms ↭ Training: Baum-Welch algorithm ↭ Prediction/Decoding: Viterbi algorithm states x1 ,... , xT Viterbi observations parameters y1 ,... , y T Baum- (A, B, ε (0) ) Welch 30 Norwegian University of Life Sciences Gaussian Hidden Markov Models 33 Norwegian University of Life Sciences Gaussian Hidden Markov Models ↭ observables Yt may also be continuous, e.g. (Gaussian Hidden Markov Models) P (Yt |Xt = i) = N (µi , ϑi2 ) ↭ in this case, B is replaced by class-wise parameter pairs {µ1 , ϑ12 }, {µ2 , ϑ22 },... , {µm , ϑm 2 } Yt |Xt = 1 Yt |Xt = 2 (observed) value 34 Norwegian University of Life Sciences Classification & Clustering 2 Norwegian University of Life Sciences Classification & Clustering ↭ Classification & clustering problems in time series data ↭ comparing time series segments to each other (clustering) ↭ modeling a (time-independent) target variable from a time series segment (classification) ↭ time series segmentation & change-point detection ↭ See Maharaj et al. (2019); Aghabozorgi et al. (2015); Montero and Vilar (2014) Approaches for time series clustering and classification ↭ distance-based ↭ feature-based ↭ model-based 3 Norwegian University of Life Sciences Distance measures 4 Norwegian University of Life Sciences Distance measures ↭ How to quantify "distance" or "similarity"? Recap: distance metrics ↭ in mathematics, a metric in Rn is a function d(.,.) : Rn → Rn ↑ R→0 , which fulfills the following conditions for all x, y, z ↓ Rn : ↭ d(x, x) = 0 ↭ d(x, y) > 0 for all y ↔= x ↭ d(x, y) = d(y, x) (symmetry) ↭ d(x, y) + d(y, z) ↗ d(x, z) (triangle inequality) 5 Norwegian University of Life Sciences Distance measures between time series ↭ Specific distance measures for time series are based on ↭ Dynamic Time Warping (DTW) ↭ Correlation & autocorrelation ↭ Time series models (e.g., ARIMA) ↭ Frequency-domain representations ↑ feature-based methods Distance measures for time series Some of the these measures violate the assumptions of proper distance metrics. For example, two time series may have distance zero although they do not match exactly. 12 Norwegian University of Life Sciences Example: Distance measures in R 24 Norwegian University of Life Sciences Clustering of time series 25 Norwegian University of Life Sciences Hierarchical clustering ↭ Requires pair-wise distance matrix between all objects ! " (i) (j) D = d(xT , xT ) i,j ↭ Bottom-up approach: agglomerative clustering ↭ initialize with one cluster per object ↭ in each step, combine clusters with "shortest" inter-cluster distances ↭ Top-down approach: divisive clustering ↭ initialize one cluster containing all objects ↭ in each step, divide cluster with “largest” intra-cluster distances 27 Norwegian University of Life Sciences Distance-based clustering ↭ d(.,.) quantifies distances between objects ↭ How to quantify distances between clusters C1 and C2 ? ↭ Single linkage: d(C1 , C2 ) = min d(xT , yT ) xT →C1 ,yT →C2 ↭ Complete linkage: d(C1 , C2 ) = max d(xT , yT ) xT →C1 ,yT →C2 ↭ (weighted) average linkage, median linkage, etc. 28 Norwegian University of Life Sciences Dendrogram ↭ Tree-based representation of hierarchical clustering results 29 Norwegian University of Life Sciences Cluster evaluation ↭ Internal evaluation (no "ground truth" information) Dunn index Minimum of inter-cluster vs. maximum of intra-cluster distances Silhouette coe!cient Average distance among objects in the same cluster vs. average distance between objects in distinct clusters ↭ External evaluation (compares to "ground truth" information) Purity Number of elements belonging to the same ground truth class in each cluster Rand index Similarity to benchmark classifications, related to accuracy Confusion matrix 30 Norwegian University of Life Sciences Time series features I what characterizes a time series? I global statistics (e.g., mean, standard deviation) I autocorrelation, partial autocorrelation I model parameters (e.g., ARIMA) I properties related to seasonalities / periodicities (e.g., frequencies) I occurrence of particular shapes / patterns I features can be extracted from I the original time series I a transformed version of the time series (e.g., differenced, Fourier transformed, etc.) 3 Norwegian University of Life Sciences Classification algorithms I general classification algorithms I logistic regression I decision tree / random forest I k-nearest neighbors (kNN) I support vector machine I Bayes classifier Time series aspects I a time series distance measure (e.g., Dynamic Time Warping with kNN) I a set of time series features (e.g., Wavelet features) I specialized variants of classifiers for time series data (e.g., time series forest) 11 Norwegian University of Life Sciences Evaluation metrics T P +T N accuracy acc = T P +F P +F N +T N precision pre = T PT+F P P recall rec = T P +F N TP 2T P pre·rec F1 score F1 = 2T P +F P +F N = 2 pre+rec MCC MCC = Ô T P ·T N ≠F P ·F N (T P +F P )(T P +F N )(T N +F P )(T N +F N ) Table 2: Evaluation metrics for binary classification 13 Norwegian University of Life Sciences