DAT320 Interpolation - Norwegian University of Life Sciences - Autumn 2024 PDF
Document Details
Uploaded by WittyAloe
Norwegian University of Life Sciences
2024
Norwegian University of Life Sciences
Hans Ekkehard Plesser
Tags
Summary
This document covers missing values, imputation, and interpolation methods in the context of time series data analysis. The lecture notes are from Autumn 2024, and cover various aspects and considerations, including techniques for dealing with gaps in data, upsampling, downsampling, and global vs. local replacement.
Full Transcript
DAT320: Basics Preprocessing: Missing Values, Imputa- tion and Interpolation Hans Ekkehard Plesser [email protected] Autumn 2024 Norwegian University of Life Sciences Missing values in time series Global & local replacement Interpolation...
DAT320: Basics Preprocessing: Missing Values, Imputa- tion and Interpolation Hans Ekkehard Plesser [email protected] Autumn 2024 Norwegian University of Life Sciences Missing values in time series Global & local replacement Interpolation 1 Norwegian University of Life Sciences Missing values in time series 2 Norwegian University of Life Sciences Missing values—different ways of missing ▶ Missing values cannot be handled by many statistical and machine learning models ▶ First studied systematically by Rubin ▶ 3 types of missing values MCAR: Missing completely at random Completely unrelated to process of interest, e.g., data transmission error between thermometer and database MAR: Missing at random Probability of miss governed by other process, e.g., thermometer fails more often in high humidity: Missingness conditioned on humidity entirely random MNAR: Missing not at random Probability of miss related to process studied, e.g., miss more likely after prolonged cold spell ▶ See also Dong and Peng , Mack et al. , and Wikipedia 3 Norwegian University of Life Sciences Missing values in time series ▶ Consecutive missing values (sub-periods of missing values) ▶ Single missing values ▶ R package for handling missing values: imputeTS ▶ For a quick overview, see the imputeTS cheat sheet 4 Norwegian University of Life Sciences Upsampling & downsampling ▶ Upsampling (increasing resolution) requires replacement of missing values by interpolation x1 x2 x3 x1 ? x2 ? x3 ? t (a) Upsampling (factor 2) 5 Norwegian University of Life Sciences Upsampling & downsampling ▶ Upsampling (increasing resolution) requires replacement of missing values by interpolation ▶ Downsampling (decreasing resolution) may also require interpolation, if non-integer factor is used x1 x2 x3 x1 x2 x3 x4 x5 x6 x1 ? x2 ? x3 ? x1 ? x4 ? t t (a) Upsampling (factor 2) (b) Downsampling (factor 1.5) Figure 1: Missing values (indicated with ?) occur when performing upsampling, or downsampling with a non-integer factor 5 Norwegian University of Life Sciences Handling missing values ▶ Option 1: Remove missing values ▶ usually not possible for time series, since regular time axis is required ▶ Option 2: Replace missing values ▶ replace by fixed value (constant) or global distribution mean / median / etc. ▶ interpolate by non-missing neighbors (carry-forward, carry-backward) ▶ replace by rolling mean / median / weighted mean ▶ linear interpolation ▶ spline interpolation ▶ interpolation using forecasting models (→ Section Forecasting) ▶ See Moritz and Bartz-Beielstein for more information ▶ See interpolation.Rmd for code for following examples 6 Norwegian University of Life Sciences Global & local replacement 7 Norwegian University of Life Sciences Global missing value replacement ▶ Default values (often, 0 or 1) may be used for replacement ▶ Global mean / median might be computed from the data ▶ Can use random data ▶ Problem: time dynamic is not taken into account, e.g., trend or seasonality 150 150 150 Value Value Value 100 100 100 50 50 50 0 0 0 0 50 100 150 0 50 100 150 0 50 100 150 Time Time Time Data Data Data Density Density Density 0.100 0.100 0.03 0.075 0.050 Imputed 0.075 Imputed 0.02 Imputed 0.025 0.050 0.025 0.01 0.000 0.000 0.00 0 50 100 150 Original 0 50 100 150 Original 0 50 100 150 Original Value Value Value (a) Global mean (b) Global median (c) Random 8 Norwegian University of Life Sciences Local missing value replacement—Carry ▶ Last observation carried forward (LOCF): if xt is missing, replace by xt = xs , where s = max {i} it xi not missing xs xs+1... xt xt+1 xt−1 xt... xs−1 xs (a) LOCF (b) NOCB 9 Norwegian University of Life Sciences Local missing value replacement—Carry Value 150 150 Value 100 100 50 50 0 0 0 50 100 150 0 50 100 150 Time Time Data Data Density Density 0.04 0.03 0.03 0.02 Imputed 0.02 Imputed 0.01 0.01 0.00 0.00 0 50 100 150 Original 0 50 100 150 Original Value Value xs xs+1... xt xt+1 xt−1 xt... xs−1 xs (a) Last carried forward (b) Next carried backward 10 Norwegian University of Life Sciences Local missing value replacement—Rolling average ▶ Rolling average replacement (moving average) ▶ replace missing value by mean (µ) over local neighbors ▶ if xt is missing, replace xt = µ(xt−ℓ ,... , xt+ℓ ), ℓ = k2 ▶ Other options for missing value replacement via rolling statistics: ▶ median ▶ weighted average: µw (x) = wT x, where w ∈ R2ℓ , ∥w∥1 = 1, and x = (xt−ℓ ,... , xt−1 , xt+1 ,... , xt+ℓ ) (−→ does not contain xt ) xt−3 xt−2 xt−1 ? xt+1 xt+2 xt+3 µ Figure 5: Missing value replacement by rolling average (ℓ = 2). 11 Norwegian University of Life Sciences Local missing value replacement—Rolling average Filter Linearly weighted moving average 0.20 Exponential Flat i ( ℓ(ℓ+1) i≤ℓ Linear wi = 2ℓ−i+1 0.15 ℓ(ℓ+1) i>ℓ for i = 1,... , 2ℓ W 0.10 Exponentially weighted moving average 0.05 ( C · (1 − α)ℓ−ii≤ℓ wi = C · (1 − α)i−ℓ−1 i > ℓ 0.00 α for i = 1,... , 2ℓ, C = 2−2(1−α)ℓ , α ∈ [0, 1] −10 −5 0 t 5 10 12 Norwegian University of Life Sciences Local missing value replacement—Rolling average 150 150 150 Value Value Value 100 100 100 50 50 50 0 0 0 0 50 100 150 0 50 100 150 0 50 100 150 Time Time Time Data Data Data Density Density Density 0.03 0.03 0.03 0.02 Imputed 0.02 Imputed 0.02 Imputed 0.01 0.01 0.01 0.00 0.00 0.00 0 50 100 150 Original 0 50 100 150 Original 0 50 100 150 Original Value Value Value (a) Plain rolling average (b) Linear weighted average (c) Exponentially weighted (ℓ = 2) (ℓ = 2) average (ℓ = 2, α = 0.5) 13 Norwegian University of Life Sciences Local missing value replacement Function na_ma for rolling average imputation ▶ Parameter k in package imputeTS denotes the number of time steps on one side, i.e. the window size is 2k ▶ “If all observations in the current window are NA, the window size is automatically increased until there are at least 2 non-NA values present.” (imputeTS documentation) ▶ The default weighting is "exponential" ▶ maxgap parameter can be used to prevent imputation for long gaps (applies to all na_* functions) ▶ Never use a function before you have read its documentation! 14 Norwegian University of Life Sciences Interpolation 15 Norwegian University of Life Sciences Linear interpolation ▶ Missing values are interpolated linearly between previous and next non-missing values ▶ If xt is missing, replace by s2 − t t − s1 xt = x s1 + xs , s2 − s1 s2 − s1 2 where s1 = max {i} and s2 = min {i} it xi not missing xi not missing 16 Norwegian University of Life Sciences Linear interpolation x1 x2 ? ? ? x6 x7 3 4 x2 + 41 x6 x1 x2 ? ? ? x6 x7 1 2 x2 + 21 x6 x1 x2 ? ? ? x6 x7 1 4 x2 + 43 x6 Figure 7: Linear interpolation 17 Norwegian University of Life Sciences Linear interpolation 18 Norwegian University of Life Sciences Linear interpolation ▶ connecting 2 points is not always sufficient to capture the behavior of the time series ▶ Option 1: Increase degree of polynomial (intractable for high number of points) ▶ Option 2: Spline interpolation (piecewise polynomial functions) 18 Norwegian University of Life Sciences Spline interpolation ▶ Spline interpolation requires ▶ a polynomial basis ▶ a set of x-y-corrdinates (knots) ▶ Use polynomials of degree 3 (cubic) ▶ One polynomial q(t) per sequence of missing values (xs1 , xs2 ) ▶ Fit parameters of polynomials, such that ▶ q(s1 ) = xs1 ▶ q(s2 ) = xs2 ▶ q ′ (s1 ) = D(xs1 ) ▶ q ′ (s2 ) = −D(xs2 +1 ) ▶ For more about splines, see Wikipedia 19 Norwegian University of Life Sciences Spline interpolation ▶ Spline interpolation requires ▶ a polynomial basis ▶ a set of x-y-corrdinates (knots) ▶ Use polynomials of degree 3 (cubic) ▶ One polynomial q(t) per sequence of missing values (xs1 , xs2 ) ▶ Fit parameters of polynomials, such that ▶ q(s1 ) = xs1 ▶ q(s2 ) = xs2 ▶ q ′ (s1 ) = D(xs1 ) ▶ q ′ (s2 ) = −D(xs2 +1 ) ▶ For more about splines, see Wikipedia 19 Norwegian University of Life Sciences Linear vs spline interpolation 150 200 Value Value 100 100 50 0 0 −100 0 50 100 150 0 50 100 150 Time Time Data Data Density Density 0.03 0.03 0.02 Imputed 0.02 Imputed 0.01 0.01 0.00 0.00 0 50 100 150 Original −100 0 100 200 Original Value Value (a) Linear interpolation (b) Spline interpolation Check the scales! 20 a careful look at the scales in the left and right Norwegian Take plot! University of Life Sciences Literature Y. Dong and C.-Y. J. Peng. Principled missing data methods for researchers. SpringerPlus, 2(1):222, May 2013. ISSN 2193-1801. doi: 10.1186/2193-1801-2-222. URL https://doi.org/10.1186/2193-1801-2-222. C. Mack, Z. Su, and D. Westreich. Types of missing data. In Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide. Agency for Healthcare Research and Quality (US), Rockville, MD, third edition, 2018. URL https://www.ncbi.nlm.nih.gov/books/NBK493614/. S. Moritz and T. Bartz-Beielstein. imputeTS: Time Series Missing Value Imputation in R. The R Journal, 9(1):207–218, 2017. doi: 10.32614/RJ-2017-009. URL https://doi.org/10.32614/RJ-2017-009. D. B. Rubin. Inference and Missing Data. Biometrika, 63(3):581–592, 1976. ISSN 0006-3444. doi: 10.2307/2335739. URL https://www.jstor.org/stable/2335739. 21 Norwegian University of Life Sciences