Neural Networks - NN for Time Series Analysis PDF
Document Details
Uploaded by VictoriousGlockenspiel
Tags
Related
- Neural Networks Lecture 4: Convolutional Neural Networks Applications PDF
- Neural Networks Lecture 5 (contd) RNN Applications 2022 PDF
- Graph Neural Networks for Low-Energy Event Classification & Reconstruction in IceCube PDF
- Module 4 - CNN PDF
- Transformers_made_easy PDF
- RNN Processing Sequences Using RNNs and CNNs PDF
Summary
This document presents an outline for a lecture on neural networks and their application to time series analysis. The lecture covers various domains, analysis tasks, and forecasting methods.
Full Transcript
Neural Networks - NN for Time Series Analysis - Outline 1) Time Series Analysis Domains 2) Analysis Tasks and Metrics 3) Typical Time Series EDA and Preprocessing 4) Approaches for Forecasting 5) Approaches for Classification, Anomaly Detection 2/57 1) Time Series Analysis Domains 3/57 1. T...
Neural Networks - NN for Time Series Analysis - Outline 1) Time Series Analysis Domains 2) Analysis Tasks and Metrics 3) Typical Time Series EDA and Preprocessing 4) Approaches for Forecasting 5) Approaches for Classification, Anomaly Detection 2/57 1) Time Series Analysis Domains 3/57 1. Time Series Analysis Domains Examples from Economics and Finance ● GDP, unemployment rates, interest rates, stock prices 4/57 1. Time Series Analysis Domains Examples from Economics and Finance ● GDP, unemployment rates, interest rates, stock prices 5/57 1. Time Series Analysis Domains Healthcare ● Long term patient monitoring to predict disease outcomes ● Predict disease outbreaks ● Predict healthcare resource needs 6/57 1. Time Series Analysis Domains Industrial settings: ● Utilities management: e.g. predict need for electricity depending on weather patterns or productivity of green energy producers, predict water consumption Predicted forecast of water demand in London. Example on towardsdatascience.com 7/57 1. Time Series Analysis Domains Industrial settings: ● ● Telecommunication: detect data traffic outliers, detect network incidents, monitor network performance (service delivery) Transportation: traffic prediction, route optimization, predict charging station occupancy in a city Ma and Faye. Multistep electric vehicle charging station occupancy prediction using hybrid LSTM neural networks. Energy, Vol. 244, 2022 8/57 1. Time Series Analysis Domains ● Environmental Science – analyze impact of climate change, predict animal population sizes, analyze levels of pollution Kumar et al. Nature Reports. The influence of solar-modulated regional circulations and galactic cosmic rays on global cloud distribution, 2023 9/57 2) Analysis Tasks and Metrics 10/57 2. Analysis Tasks and Metrics Forecasting: predict future values or trends in a sequence of data points based on past observations. Challenges: ● ● ● ● Complexity of Temporal Patterns: can be intricate, including multiple seasonality (e.g. day level, month level), trends and irregularities Data Quality and Missing Values: noisy data, need for imputation techniques Non Stationarity: most real-world datasets exhibit changing statistical properties over time Effect of external factors: forecasting becomes more challenging when it is not a simple auto-regressive process – e.g. influence of policy decisions, economic changes 11/57 2. Analysis Tasks and Metrics Forecasting – Metrics ● ● ● Mean Absolute Error (MAE): n 1 MAE= ∑| y^ i − y i| n i=1 – Advantage: easy to compute, less sensitive to outliers than MSE – Disadvantage: no distinction between over- or under-estimation, no magnitude of individual errors, penalizes less than MSE Mean Squared Error (MSE): n 1 MSE= ∑ ( y^ i − y i )2 n i=1 – Advantages: captures both large and small errors, penalizes large errors – Disadvantages: strong penalties when outliers are present, not interpretable since units of measurement are squared Root Mean Squared Error (RMSE): √ n 1 2 RMSE= ∑ ( y^ i − y i ) n i=1 – Advantages: still penalizes large errors, is interpretable – Disadvantages: still sensitive to outliers, does not distinguish between over- or under-estimation 12/57 2. Analysis Tasks and Metrics Forecasting – Metrics ● ● Mean Absolute Percentage Error (MAPE): n y^ − y i 1 MAPE= ∑ i ×100 n i=1 yi | | – Advantage: easy to compare across datasets and interpret – Disadvantage: sensitive to extreme values, cannot handle 0 values, under and overestimation give different metric results => not simmetric Symmetric Mean Absolute Percentage Error (MAPE): n | y^ i − y i| 1 SMAPE= ∑ ×100 n i=1 (| y^ i|+|y i|)/2 – Advantages: is symmetric, can handle 0 values – Disadvantages: sensitive to extreme outliers, can produce inifinite values if prediction and ground truth are both 0 MASE= ● Mean Absoulte Scaled Error (MASE): MAE n 1 ∑ |y − y i−1| n−1 i=2 i – Advantages: scale independent → can compare forecast accuracies across different time series with varying scales, robust against outliers – Disadvantages: sensitive to 0 values in denominator; assumes the naive model (repeating last value in observed series for any future prediction) is relevant and accurate 13/57 2. Analysis Tasks and Metrics Classification: categorize or label time-ordered sequences into predefined categories. Often encountered task in healthcare (e.g. patient monitoring, activity classification), manufacturing (e.g. fault detection), or finance (fraud detection). Challenges ● Variable length sequences ● Capturing the meaningful temporal dynamics ● ● ● Transferability across individuals: variablity in characteristics for the same class across individuals time series Transferability across domains: models trained on one type of time series may not generalize well to other domains Handling missing data → imputation requirements 14/57 2. Analysis Tasks and Metrics Classification Metrics: the typical ones (e.g. accuracy, precision, recall, F1, ROC-AUC) ● Specifics in how TP, FP, TN and FN are computed for interval based detections 15/57 2. Analysis Tasks and Metrics Anomaly Detection and Changepoint Detection ● ● Anomaly Detection: identify outliers that represent significantly deviant events or significant changes in the underlying data distribution Changepoint detection: identify points or periods in a time series where the underlying data distribution undergoes a significant shift Challenges ● ● ● ● Imbalanced datasets: anomalous behavior is far less frequent than “normal” one Dynamic Nature: anomalies can evolve over time Seasonality, Trends, non-stationarity: anomalies and changepoints have to be distinguished from seasonal or trend variations Noise and Outliers: need to distinguish between random noise and meaningful outliers 16/57 2. Analysis Tasks and Metrics Anomaly Detection and Changepoint Detection Metrics 17/57 2. Analysis Tasks and Metrics Benchmark datasets ● ● ● ● Forecasting: – Longterm: Electricity Transformer Dataset (ETT), Traffic (PeMS), Respitory Illness Monitoring (ILI) , Weather (WetterStation) – Short term: M4 benchmark (collection of 100k time series from diverse domains) Classification: UEA benchmark Anomaly Detection: Server Machine Dataset (SMD), NASA Anomaly Detection Datasets (SMAP, MSL), Secure Water Treatment (SWaT) Changepoint Detection: Skoltech Anomaly Benchmark (30+ datasets), Time Series Segmentation Benchmark (75 annotated time series with 1-9 segments) 18/57 3) Typical Time Series EDA and Preprocessing 19/57 3. Typical Time Series EDA and Preprocessing Smoothing methods ● Typically used to remove noise and transient outliers ● => apply smoothing filters: – Moving Average: compute average of neighbouring points in a specified window: captures long-term trends – Median Filter: replace each point in a window with the median value of that window → smoothen time series with impulse noise or very sharp spikes – Exponential Smoothing: – r Y^ t = α (Y t + ∑ (1− α )i Y t−1 ) i=1 ● Useful when wanting to capture short-term trends ● Holt’s method – variation of exponential smoothing when a trend is present ● Holt-Winters method – variation for trend + seasonality Savitzky-Golay Filter: fit a polynomial to a sliding window of data points, use polinomial to re-estimate the points – remove noise, while maintaining time series features 20/57 3. Typical Time Series EDA and Preprocessing Missing value imputation ● ● ● ● Constant imputation: not really recommended, unless the missingness has to be a category (with a value) in itself Last- or Next- Observation Carried Forward: replace missing values with immediately preceding or subsequent observation → can be used when time series is stationary Mean / Median / Mode imputation: replace missing values with mean, median or mode of available values (note: could underestimate variance) Rolling statistic imputation: apply mean / median / mode imputation over a specified window. (Note: window size highly dependent on application and time series features) 21/57 3. Typical Time Series EDA and Preprocessing Missing value imputation ● ● ● ● Linear interpolation: assumes a linear relation between time series values Spline interpolation: locally interpolate using low-degree polynomials (note: assumes a smoothness of the time series values) k-NN imputation: replace missing values based on the values of k nearest neighbours. (Note: can be computationally intensive when dataset is very large) STL Decomposition: break down time series into seasonal, trend and residual components; impute values in the residuals, then reconstruct the series 22/57 3. Typical Time Series EDA and Preprocessing Seasonal and Trend Decomposition using Loess (STL) 23/57 3. Typical Time Series EDA and Preprocessing Seasonal and Trend Decomposition using Loess (STL) ● ● Allows estimation of models for seasonal, trend and residual components independently Two main parameters: – Trend window: controls how rapidly the trend cycle can change – Seasonal window: controls how rapidly the seasonal cycle can change; a value of +inf will indicate periodic and identic reoccurrance (i.e. the seasonal component is the same all throughout the time series) – Additive or multiplicative seasonality – additive is assumed by default; multiplicative used when assuming the seasonal change has a non-linear influence on the series values 24/57 3. Typical Time Series EDA and Preprocessing Stabilize Variance in Data ● ● If using an STL decomposition, variance stabilization is also required Obtained through a transformation function. Examples: – Log transform: can be applied to non-negative data – Box-Cox transform: allows for negative values too 25/57 3. Typical Time Series EDA and Preprocessing Trend and Mean Normalization ● Normalization is useful (even required) when the numeric values in the time series are large and when the prediction model uses non-linear functions that saturate easily (e.g. sigmoid) – If input, output window are deseasonalized ● – Trend Normalization – subtract trend value of last item in input sequence from both input and output Mean Normalization – when STL not performed, scale whole series by the mean of that series 26/57 4) Approaches for Forecasting 27/57 4. Approaches for Forecasting - ARIMA ARIMA = Auto Regressive Integrated Moving Average (a.k.a. Box-Jenkins method) ● A very good baseline (must beat) AR (Auto Regressive) component: attempt to predict future values based on past values. AR requires the series to be stationary MA (Moving Average) component: attempt to predict future values based on past forecasting errors. This assumes that an AR model can approximate the series. 28/57 4. Approaches for Forecasting - ARIMA AR(p) model p y t =c + ϕ 1 y t −1 + ϕ 2 y t −2 +…+ ϕ p y t − p + ϵt =c +∑ ϕi y t−i i=1 ● ● p – the order parameter: how many prior steps to use in the regression φ1, φ2, … , φp, to be estimated MA(q) model q y t =c + θ1 ϵt −1 + θ2 ϵt −2 +…+ θ q ϵt−q + ϵt =c+ ∑ θ j ϵt− j j=1 ● q – the order parameter: how many prior white noise error terms to use in the regression ● εt-j – is a previous white noise error term ● θj – to be estimated 29/57 4. Approaches for Forecasting - ARIMA An ARMA model expects the time series to be stationary I (Integrated) model component ● ● Procedure to make a series stationary through d-th degree differencing Differencing – 1st degree y t = y t − y t −1 – 2nd degree y t =( y t − y t −1 )−( y t−1 − y t −2 ) 30/57 4. Approaches for Forecasting - ARIMA How to choose find the appropriate hyperparameters AR(p), MA(q) and I(d)? ● Using (partial-) Auto Correlation Function plots ● By minimising information criteria such as: – Akaike Information Criterion (AIC) – Bayesian Information Criterion (BIC) – Note: AIC and BIC are not usually used to fit the models, but rather to select among a set of already fit candidate models 31/57 4. Approaches for Forecasting - ARIMA Using (partial-) Auto Correlation Function plots ACF = compute correlation between observations separated by k time steps PACF = compute the correlation k-steps away, while also accounting for a linear combination of intermediate lags Rules of Thumb: If ACF plot shows shar cutoff and/or the lag-1 autocorrelation is negative → consider adding an MA term. If PACF plot shows a sharp cut off and/or the lag-1 autocorrelation is positive → consider adding an AR term to the model. 32/57 4. Approaches for Forecasting – RNN models RNN for Time Series Forecasting benchmarking paper [1] [1] Hewamalage et al. (2021). Recurrent neural networks for time series forecasting: Current status and future directions. International Journal of Forecasting, 37(1), 388-427. 33/57 4. Approaches for Forecasting – RNN models RNN for Time Series Forecasting benchmarking paper [1] Apply and evaluate all the pre-processing techniques discussed in Section 3 of the lecture: ● w/ or w/o STL decomposition ● Variance stabilization through transformation ● Trend / Mean normalization Post-processing for final error metric computation 1. Reverse the local normalization by adding the trend value of the last input point. 2. Reverse deseasonalization by adding back the seasonality components. 3. Reverse the log transformation by taking the exponential. 4. Subtract 1, if the data contain 0s 5. For integer data, round the forecasts to the closest integer. 6. Clip all negative values at 0 (To allow for only positive values in the forecasts). [1] Hewamalage et al. (2021). Recurrent neural networks for time series forecasting: Current status and future directions. International Journal of Forecasting, 37(1), 388-427. 34/57 4. Approaches for Forecasting – RNN models RNN for Time Series Forecasting benchmarking paper [1] Evaluate RNN models in different architectural and optimization setups: [1] Hewamalage et al. (2021). Recurrent neural networks for time series forecasting: Current status and future directions. International Journal of Forecasting, 37(1), 388-427. 35/57 4. Approaches for Forecasting – RNN models RNN for Time Series Forecasting benchmarking paper [1] Takeaways 36/57 4. Approaches for Forecasting – RNN models RNN for Time Series Forecasting benchmarking paper [1] Takeaways 37/57 4. Approaches for Forecasting – RNN models RNN for Time Series Forecasting benchmarking paper [1] Takeaways 38/57 4. Approaches for Forecasting – RNN models RNN for Time Series Forecasting benchmarking paper [1] Takeaways: Analysis of Seasonality Modeling 39/57 4. Approaches for Forecasting – A Transformer based Model AutoFormer [2] ● Designed for long-term predictions and large input windows ● Have a built-in Series Decomposition Block ● Replace standard self-attention with auto-correlation [2] Wu et al., (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS 40/57 4. Approaches for Forecasting – A Transformer based Model AutoFormer [2] [2] Wu et al., (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS 41/57 4. Approaches for Forecasting – A Transformer based Model AutoFormer [2] – AutoCorrelation Block [2] Wu et al., (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS 42/57 4. Approaches for Forecasting – A Transformer based Model AutoFormer [2] – AutoCorrelation Block [2] Wu et al., (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS 43/57 4. Approaches for Forecasting – A Transformer based Model AutoFormer [2] – Results Note: the Exchange Dataset is without obvious periodicity. [2] Wu et al., (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS 44/57 4. Approaches for Forecasting – A Transformer based Model AutoFormer [2] – Results [2] Wu et al., (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS 45/57 4. Approaches for Forecasting – Temporal 2DVariation Modeling TimesNet [3] ● Decompose and analyze both seasonal- and trend- cycle components [3] Wu et al. (2023). Timesnet: Temporal 2d-variation modeling for general time series analysis. ICLR 2023 46/57 4. Approaches for Forecasting – Temporal 2DVariation Modeling TimesNet [3] ● TimesBlock – convert the 1D time series to a 2D space to simultaneously model intra- and inter- period variations [3] Wu et al. (2023). Timesnet: Temporal 2d-variation modeling for general time series analysis. ICLR 2023 47/57 4. Approaches for Forecasting – Temporal 2DVariation Modeling TimesNet [3] ● TimesBlock – convert the 1D time series to a 2D space to simultaneously model intra- and inter- period variations [3] Wu et al. (2023). Timesnet: Temporal 2d-variation modeling for general time series analysis. ICLR 2023 48/57 4. Approaches for Forecasting – Temporal 2DVariation Modeling TimesNet [3] ● Overall Architecture [3] Wu et al. (2023). Timesnet: Temporal 2d-variation modeling for general time series analysis. ICLR 2023 49/57 4. Approaches for Forecasting – Temporal 2DVariation Modeling TimesNet [3] - evaluation [3] Wu et al. (2023). Timesnet: Temporal 2d-variation modeling for general time series analysis. ICLR 2023 50/57 5) Approaches for Classification and Anomaly Detection 51/57 5. Approaches for Classification and Anomaly Detection – Time Series Representation Learning TS2VEC [4] – learn latent representation of a time series using contrastive learning [4] Yue et al. (2022). Ts2vec: Towards universal representation of time series. AAAI 2022 52/57 5. Approaches for Classification and Anomaly Detection – Time Series Representation Learning Different ways to force consistency Subseries consistency → representation of a time series is closer to its sampled subseries. Temporal consistency → enforce local smoothness of representations by choosing adjacent segments as positive samples. Transformation consistency → augment input series by different transformations (e.g. scaling, permutation); help model to learn transformation-invariant representations. [4] Yue et al. (2022). Ts2vec: Towards universal representation of time series. AAAI 2022 53/57 5. Approaches for Classification and Anomaly Detection – Time Series Representation Learning Different ways to force consistency Contextual Consistency (TS2VEC) → Treat the representations at the same timestamp in two augmented contexts as positive pairs ● A context is generated by applying timestamp masking and random cropping on the input time series Hierarchical Contrasting → representations at various scales learn Contextual Representations learned through: ● Temporal contrastive losses ● Instance-wise contrastive losses [4] Yue et al. (2022). Ts2vec: Towards universal representation of time series. AAAI 2022 54/57 5. Approaches for Classification and Anomaly Detection – Time Series Representation Learning Evaluation for Classification [4] Yue et al. (2022). Ts2vec: Towards universal representation of time series. AAAI 2022 55/57 5. Approaches for Classification and Anomaly Detection – Time Series Representation Learning Evaluation for Anomaly Detection ● Performed in a streaming protocol manner → given an input slice x1, x2, …, xt, label whether xt is an anomaly Approach: ● Define anomaly score as dissimilarity in representation between a masked and an unmasked input [4] Yue et al. (2022). Ts2vec: Towards universal representation of time series. AAAI 2022 56/57 The End Tutorial Time 57/57