Intuitions on Time Series Data PDF
Document Details
Uploaded by PatientPermutation
Tags
Summary
This presentation provides an overview of time series data, covering various concepts such as univariate and multivariate time series, preprocessing, and different modeling approaches. It explains important characteristics of time series, such as trend and seasonality and, importantly, how to handle outliers or missing values. The various steps involved in applying time series models are detailed.
Full Transcript
Intuitions on Time Series Data CSMODEL Time Series A set of observations taken at specified times, usually at equally spaced intervals. Each observation in the dataset is associated with time information. 2 Time Series Although time se...
Intuitions on Time Series Data CSMODEL Time Series A set of observations taken at specified times, usually at equally spaced intervals. Each observation in the dataset is associated with time information. 2 Time Series Although time series data can be modelled using other techniques, these will not be able to capture the temporal nature of the data. 3 Time Series Classifications of Time Series Data: Univariate Time Series – Analyzes only a single variable observed across different points in time. Multivariate Time Series – Analyzes multiple variables observed across different points in time. 4 Univariate Time Series Temperature data measured in a city over time. In this case, the single variable being analyzed over time is the temperature in Fahrenheit. 5 Multivariate Time Series EEG Data (brain waves) contains multiple channels whose signals are measured continuously across the session. 6 Preprocessing Time series data is often organized in a tabular form. 7 Preprocessing The column Month represents the time of the observation. The column # of Passengers is the variable that we want to analyze as a time series. 8 Preprocessing Make sure that time is represented in an appropriate format. For example, convert the string representation of the column Month to a datetime object. 9 Preprocessing Month # of Passengers Computers cannot infer the sequential information from 1949-01 112 string, so it’s important to 1949-02 118 always represent time 1949-03 132 information as time objects. 1949-04 129 1949-05 121 1949-06 135 1949-07 148 1949-08 114 10 Preprocessing Month # of Passengers It is also a good idea to make the time information as the 1949-01 112 index of our DataFrame. 1949-02 118 1949-03 132 For example, assign the 1949-04 129 column Month as the index of 1949-05 121 this DataFrame. 1949-06 135 1949-07 148 1949-08 114 11 Preprocessing # of Passengers This way, desired 1949-01 112 observations can be selected 1949-02 118 easily using their indices. 1949-03 132 1949-04 129 1949-05 121 1949-06 135 1949-07 148 1949-08 114 12 Plotting The simplest way to visualize univariate time series data is to plot it as a line graph, with the x-axis being the time and the y-axis being the variable to be analyzed. 13 Considerations Time Series data is not (usually) independent. One data point is likely influenced by the surrounding data For example, each observation is influenced by the peso exchange rate in the previous years. 14 Considerations The data points are usually not identically distributed The means and distributions for different time periods are obviously not the same. 15 Considerations The ordering (sequence) matters. Changing the order of observations in a time series dataset results to an analysis of different meaning. 16 Time Series Analysis The goal of time series analysis is to find a mathematical model to capture the pattern of the time series. The model can be used to: Describe the features of the time series Explain the interaction between different components of the time series. Forecast the future values of the series. 17 Characteristics Trend Seasonality Outliers Missing Values 18 Trend Shows the general tendency of the data to increase or decrease over time (i.e., on average, do the measurements tend to increase or decrease over time?) 19 Seasonality Refers to periodic fluctuations that repeat over time. 20 Outliers Refers to unusual or extreme observations in the data. 21 Missing Values Refers to missing observations in the dataset. 22 Handling Outliers Outliers can be caused by different things: They may be caused by errors in measurement or errors in the data collection process. They may be actual anomalies that really occurred in the real world. 23 Handling Outliers When dealing with outliers, be careful to consider why they have occurred. 24 Data Analysis Questions regarding outliers and anomalies: Anything strange with this data? Maltz, M. D. (2010). Look before you analyze: Visualizing data in criminal justice. In Piquero, A.. and Weisburd, D., editors, Handbook of Quantitative Criminology, chapter 3, pages 25-52. Springer New York, New York, NY. Data: Sexual Assault data in Boulder, CO (1960-2004) Data Analysis Questions regarding outliers and anomalies: Anything strange with this data? Maltz, M. D. (2010). Look before you analyze: Visualizing data in criminal justice. In Piquero, A.. and Weisburd, D., editors, Handbook of Quantitative Criminology, chapter 3, pages 25-52. Springer New York, New York, NY. Data: Murder Data in Oklahoma City (1960-2004) Handling Outliers One way to handle outliers in time series is to just remove them and convert them to missing values. 27 Handling Missing Values Handle missing values using imputation. Imputation refers to estimating the missing values based on the other values that are available. There are several techniques for imputation, some of which are: Get the same values from the nearest data point Linear interpolation 28 Time Series Example This is a time series plot of the annual number of earthquakes in the world with seismic magnitude over 7.0. 29 Time Series There is no consistent trend. There is no seasonality. There appears to be no missing values nor outliers. 30 Time Series Example This is a time series plot of the quarterly production of beer in Australia for 18 years. 31 Time Series There is an upward trend. There is seasonality. There appears to be no missing values nor outliers. 32 Autocorrelation Autocorrelation measures the degree of similarity of a time series with a lagged version of itself. Blue: Original Orange: Lagged (lag of 3) 33 Autocorrelation If we set the lag to 12, then the seasonality is revealed. Blue: Original Orange: Lagged (lag of 12) 34 Autocorrelation The autocorrelation plot shows the autocorrelation values for different lag values. Correlation As expected, autocorrelation is high at around a lag of 12. This means that we have a Lag season which occurs for every 12 years 35 Stationarity Data is stationary if there is no trend or no seasonality in the time series. It means the mean, variance, and autocorrelation remain relatively constant over time. Its properties do not depend on the time at which the series is observed. It will have no predictable pattern in the long-term. The plot of a stationary time series is roughly horizontal (with some cyclic behavior) with constant variance. 36 Stationarity 37 Stationarity Seasonal 38 Stationarity Seasonal Has Trend 39 Stationarity Seasonal Has Trend Stationary 40 Stationarity The Dickey-Fuller test is used to test is a process is stationary. This test returns a p-value. If the p-value is low (< 0.05), then the time series is more or less stationary. An updated version, the Augmented Dickey-Fuller test, is made by the same statisticians to accommodate for larger and more complex data. 41 Stationarity It is important to transform time series data to a stationary time series before modelling. Why? Most time series models assume that each point is independent of one another. Best indication is when the dataset of past instances is stationary Techniques to make it more stationary: Differencing Residual Modelling Log Transform 42 Differencing Differencing means subtracting each data point with the data before it. Given the time series 𝑍𝑍𝑡𝑡 , we create a new time series: 𝑌𝑌𝑖𝑖 = 𝑍𝑍𝑖𝑖 − 𝑍𝑍𝑖𝑖−1 Use to stabilize the mean of a time series. may establish stationarity 43 Differencing 44 Differencing Stationarity – mean and variance remain relatively constant over time Source: https://towardsdatascience.com/why-does-stationarity-matter- in-time-series-analysis-e2fb7be74454 45 Differencing After (first-order) differencing, data is still not stationary Source: https://towardsdatascience.com/why-does-stationarity-matter- in-time-series-analysis-e2fb7be74454 46 Residual Modelling Residual modelling means fitting a line or a curve to model the time series, and then using the residuals as the new data points. residual – difference between observed value and estimated value Used to eliminate trend in the time series. 47 Log Transform Log transform means applying a logarithm operation on each data point. This helps stabilize the variance, as it reduces the impact of higher values. 48 Log Transform Differencing may then be applied to the Log transformed-data. The mean and variance then level out and shows no signs of trend or strong seasonality Source: https://towardsdatascience.com/why-does-stationarity-matter- in-time-series-analysis-e2fb7be74454 49 Stationarity In many cases the process of transforming time series data into stationary is abstracted from you (i.e., using Python functions). However, you should be still be familiar with the concept of stationarity and the operations behind it if you plan to delve deeper into time series analysis. 50 Time Series Modelling Goal: Represent a time series data with a model. Several approaches for Time Series Modelling: Moving Average MA (moving average) model AR (autoregressive) model ARIMA (autoregressive integrated moving average) model 51 Moving Average The most straightforward/naïve approach to time series modelling. The model states that the next observation is the average of all previous observations. If we use this for forecasting, all future values will be the same as the most recent one. 52 Moving Average The concept of a window is used for the moving average. consisted of 𝑛𝑛 observations In MA forecasting, the next observation is the mean of the previous 𝑛𝑛 observations. 53 Moving Average Smoothness of the curve can be controlled with the sliding window. The wider the window, the smoother the curve. 54 Moving Average Can be used to describe the overall trend of the time series. 55 Moving Average Model Moving Average (MA) Model Main Idea: Predict the value of the time series based on the errors of the previous values. Use a regression model on the errors of the previous value of the same data. Requires that the data is stationary, since the data points needs to be identically distributed for regression to work. 56 Moving Average Model MA (1) Model: 𝑦𝑦𝑡𝑡 = 𝜇𝜇 + 𝜙𝜙1 𝜖𝜖𝑡𝑡−1 MA (2) Model: 𝑦𝑦𝑡𝑡 = 𝜇𝜇 + 𝜙𝜙1 𝜖𝜖𝑡𝑡−1 + 𝜙𝜙2 𝜖𝜖𝑡𝑡−2 The coefficients can be fitted to minimize the error Blue: Original Red: MA (2) 57 Autoregressive Model Autoregressive (AR) Model Main Idea: Predict the value of the time series at a given point, based on the previous values of the time series. Use a regression model on the previous value of the same data. Requires that the data is stationary, since the data points needs to be identically distributed for regression to work. 58 Autoregressive Model AR (1) Model: 𝑦𝑦𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1 𝑦𝑦𝑡𝑡−1 + 𝜖𝜖𝑡𝑡 AR (2) Model: 𝑦𝑦𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1 𝑦𝑦𝑡𝑡−1 + 𝛽𝛽2 𝑦𝑦𝑡𝑡−2 + 𝜖𝜖𝑡𝑡 The coefficients can be fitted to minimize the error Blue: Original Red: AR (2) 59 ARIMA Model Autoregressive Integrated Moving Average (ARIMA) Model Combination of AR and MA model, plus an extra “differencing” step to remove the trend. First, differencing is performed to remove trend. 60 ARIMA Model Then, a regression model combining MA and AR is used, where we assume both AR and MA use a lag of 1 as parameter. 𝑦𝑦𝑡𝑡 = 𝜙𝜙1 𝑍𝑍𝑡𝑡−1 + 𝜃𝜃1 𝜖𝜖𝑡𝑡−1 + 𝜖𝜖𝑡𝑡 AR MA When forecasting, bring back the trend after making the prediction 61 Summary It might be of interest to capture the temporal nature of datasets involving time in our data models. We can perform operations to transform our time series data to eliminate trend, seasonality, so we can focus on the core pattern of the time series. There are models that allow us to model time series data, which allow us to describe its features and do forecasting. 62 Intuitions on Time Series Data CSMODEL