BCI methods overview.docx

BCI methods overview **[LDA: Linear Discriminant Analysis]** Idea: train a model on past experience which can predict label of new unseen data. (f(x) = 0 gives the decision boundary H) F(x) = w^T^x + b - Supervised learning - Classification problem (continuous) - Place in pipeline: machine learning or pre-processing - Assumptions made: - Data from both classes follow a Gaussian distribution - Covariance matrices of the classes are equal **Optimization problem (maximize Fisher criterion):** \ [\$\${J\\left( w \\right) = \\ argmax}\_{w}\\frac{w\^{T}S\_{B}w}{w\^{T}S\_{w}w}\$\$]{.math.display}\ - S~B~ = Between-class covariance - S~w~ = total within-class covariance (average of the 2 classes) - Maximize between class covariance and minimize within class covariance \ [*w*= *S*~avg~^ − 1^(*m*~2~−*m*~1~)]{.math.display}\ \ [\$\$b = \\ - \\frac{1}{2}w\^{T}({m\_{1} + m}\_{2})\$\$]{.math.display}\ **Pros:** - Optimization problem can be computed analytically due to assumptions made - Linear, so fast to train **Cons:** - Computing total within covariance matrix can be tricky - Inverse of matrix is computationally expensive and difficult - Assumptions - Can only capture linearly separable data. However this can be solved by using a generalizing function fi. **[PCA: Principal Component Analysis]** Idea: give new basis-vectors for your data. You may use these for dimensionality reduction. - Unsupervised learning - Place in pipeline: pre-processing (usually) - Assumptions made: - The data is linearly correlated - Relevance is expressed by variance (high variance = more important data) - Data is continuous **Steps:** 1. Translation: all data is shifted to the origin by subtracting the mean of all data points. 2. Rotation: vectors indicating the variance become the new axis. The eigenvector with largest variance (highest eigenvalue) replaces the 1^st^ dimension, the one after that the 2^nd^, and so on 3. Scaling: scale data using eigenvalues (variance) so that the variance in all directions is the same (normalization/whitening). 4. Project each datapoint to first (few) eigenvector to get a lower dimensional representation of the data. **Optimization problem (maximize variance):** we want to find the eigenvector 'u1' that gives the most variance in the projected data \ [argmax~*u*~1~~ (*u*~1~^*T*^*Su*~1~)]{.math.display}\ - S = covariance matrix of data - We choose constraint [*u*~1~^*T*^*u*~1~ = 1]{.math.inline} so that u~1~ doesn't grow to infinity - This can be enforced using a Lagrange multiplier [*λ*]{.math.inline}: \ [argmax~*u*~1~*λ*~1~~ \[*u*~1~^*T*^*Su*~1~+ *λ*~1~(1 − *u*~1~^*T*^*u*~1~)\]]{.math.display}\ - Setting derivative to 0 with respect to [*λ*~1~]{.math.inline} gives [*u*~1~^*T*^*u*~1~ = 1]{.math.inline}. So assumption is good - Setting derivative to 0 with respect to [*u*~1~]{.math.inline} gives [*Su*~1~= *λ*~1~*u*~1~]{.math.inline} - This means that variance is maximal, if [*u*~1~]{.math.inline} is an eigenvector of covariance matrix S. Or variance is maximized if we set [*u*~1~]{.math.inline} equal to the eigenvector having the largest eigenvalue [*λ*~1~]{.math.inline} **Pros:** - Linear, so computationally effective and inexpensive - Can give basis for dimensionality reduction that helps to reduce noise and computational costs - Can help visualize high dimensional data into a lower dimensional subspace **Cons:** - PCA assumes that relevance is expressed by variance - PCA is linear, so it can only capture linear relationships between data features **[ICA: Independent Component Analysis]** Idea: when recording multiple fixed data sources. Separate these individual sources into individual components. In BCI useful for separating noise sources (eye blinks, muscle activity) from useful neural sources. - Unsupervised learning - Place in pipeline: pre-processing (usually) - Assumptions made: - Number of sources N = number of microphones N - The data sources are mixed in a linear fashion into the recordings - At least N-1 sources follow a Gaussian-distribution (at most one source may differ) - Sources are statistically independent at each time point t. **Central limit theorem:** more combined data (mixed sources) gives a more Gaussian distribution than individual ones. - Independent sources will be the least Gaussian. **Model:** \ [*x* = *As*]{.math.display}\ - x: vector of observed data - A = mixing matrix of all individual mixture coefficients a - s = vector of all sources s If we would estimate A, then its inverse W would tell us how to get to the original sources: \ [*s* = *Wx*]{.math.display}\ **Estimation of W:** 1. optimize filter vector w (one column of W), which is used to project data x onto an estimated source [*ŝ*]{.math.inline}. \ [*ŝ* = *w*^*T*^*x*]{.math.display}\ 2. To optimize w, loop until convergence: I. Initialize filter weights in vector w II. Determine direction in which kurtosis (kurtosis=0 is Gaussian) of [*ŝ*]{.math.inline} - Grows most strongly, or - Decreases most strongly III. Run gradient descent to improve w 3. Project out estimated source [*ŝ*]{.math.inline} and repeat 2) to obtain all weight vectors w to obtain full matrix W from which we can get all the vector of all estimated sources s. **Pros:** - Linear method, so fast - Many variants exist that use different assumptions on what 'statistical independence' means yielding different results - Is able to relatively well separate mixtures into individual components which is very useful for e.g. BCI applications **Cons:** - ICA is underdetermined, as we only know X. multiple runs of ICA might yield different results. You can compare these on their similarities such as cosine similarity in vector space - Nr of sources might not be equal to the nr of sensors - Assumes statistical independence \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-- **[Logistic Regression]** Idea: predict future based on past experience. Output is a probability which gives the likelihood of y=1. - supervised learning - classification problem (categorical/binary) - Place in pipeline: main machine learning part **Model:** \ [\$\$h\_{w}\\left( x \\right) = \\ \\frac{1}{1 + e\^{- w\^{T}x}}\$\$]{.math.display}\ - Linear regression mixed with a sigmoid function g(z) to yield a probability value - Use of g(z) has similarity to post-processing of a linear method like LDA - Given a certain data point x, [*h*~*w*~(*x*)]{.math.inline} is the probability that y=1. **How to obtain weights w:** Using loss functions, once you have a loss function, find the global minimum through gradient descent. Convex is desired over non-convex loss functions. - Quadratic Loss is not good for Logistic Regression. Due to the interaction of the quadratic with the sigmoid function. This creates wrinkles in the function leading to a non-convex shape. Adapted Loss function for Logistic Regression: \ [\$\$J\\left( h\_{w}\\left( x \\right),y \\right) = \\ \\left\\{ \\begin{matrix} - \\log\\left( h\_{w}\\left( x \\right) \\right)\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ for\\ y = 1 \\\\ - \\log\\left( 1 - h\_{w}\\left( x \\right) \\right)\\ \\ \\ \\ \\ for\\ y = 0 \\\\ \\end{matrix} \\right.\\ \$\$]{.math.display}\ - False negative decisions should be penalized much more compared to false positives (think about tumour classification) **Pros:** - Probabilistic output, allows for good interpretation as a basis to make certain real-life decisions. - Can be regularized by adding a penalty term (L1). Penalizes large weights, less prone to overfitting - Fast to train **Cons:** - Requires large sample size in order to obtain stable estimates - Non-linear problems can't be solved using logistic regression (can be extended with kernel) **[CSP: Common Spatial Patterns]** Idea: extract oscillatory features from a multi-channel recording. CSP decomposes original channel space linearly into subspace components - supervised learning - Place in pipeline: feature-extraction/encoding - Assumptions made - Sources are independent and non-Gaussian There is high variability in observed ERD/ERS features between subjects, sessions and trials. Solution for this include longer breaks between trials, familiarisation phase and/or filter out background signals (ICA). **CSP steps:** 1. Bandpass + spatial filtering 2. For every CSP channel extract the average log bandpower of either - A fixed time window relative to a triggered trial start (synchronous mode) - Short sliding windows during a self-paced application (asynchronous mode) 3. Concatenate log bandpower features of selected CSP channels 4. Use a pre-trained classification model (LDA, logistic regression) to obtain class decisions) 5. Now the user can control commands **Transformation steps by CSP:** 1. Whitening with respect to [(*S*~1~+ *S*~2~)]{.math.inline} 2. Rotation of the eigenvectors to form the new CSP coordinate axes **Standard forward model:** \ [*x* = *As*]{.math.display}\ Task: learn one or multiple optimized spatial unmixing filters w (like ICA). We want to select the components showing the most extreme eigenvalues, ideally at both ends of spectrum. Find a filter matrix W and a diagonal matrix D with ordered entries [0 ≤ *λ*~*j*~ ≤ 1]{.math.inline} such that: \ [*W*^*T*^*S*~*i*~*W* = *D*]{.math.display}\ Analytical solution is provided by solving the generalized eigenvalue problem: \ [*S*~*i*~*W* = *λS*~*i*~*W*]{.math.display}\ **Pros:** - Fast to train - Early spatial filtering step reduces dimensionality in online applications and speeds up following pipeline - Can be regularized **Cons:** - CSP is sensitive to outliers (can remove outliers or noisy EEG channels before training) - Supervised so can overfit to training data - Hyperparameters need to be determined **Regularized CSP:** To avoid overfitting, CSP can be regularized: \ [\$\$\\text{argmax}\_{w}\\frac{W\^{T}\\sum\_{1}\^{}W}{W\^{T}\\sum\_{2}\^{}W + \\ \\alpha P(W)}\$\$]{.math.display}\ - [\$\\sum\_{i}\^{} =\$]{.math.inline} covariance matrix of class i - Penalty function P measures how well spatial filters satisfy a given prior (L1 or L2 norm) - α = user defined regularization parameter. This gives the strength of the regularization **Filter Bank CSP:** 1. Filter bank that puts EEG data through frequency filtering 2. Spatial filtering; one CSP channel per frequency band 3. Feature selection; put all filters from all frequency bands (e.g. 64 per band) together or take only best few of each band. From all these select only the few best ones (rank them on some condition) 4. Classification; naïve bayes, decision tree, k-nearest neighbour, support vector machine, etc. -\>FBCSP outperforms CSP and is state of the art for BCI competitions **[Linear Regression]** Idea: learn a function f(x) using past experiences that will predict the corresponding label y as good as possible. - supervised learning - regression problem (with continuous labels instead of binary class labels) - Place in pipeline: main machine learning method - Data point x = independent/explanatory variable - Label y = dependent/response variable - X = N x D matrix containing feature values with D input dimensions - y = N x 1 vector containing continuous labels. - Linear regression versions: 1. Simple linear regression -- both X and y are 1 dimensional 2. Multivariable/multiple regression -- for X, D\>1. For y, D=1 3. Multivariate regression -- for X, D=1. For y, D\>1 4. Multivariate multiple linear regression -- for X, D\>1. For y, D\>1 - Assumptions made - Linear relationship between x and y - Residuals must be independent - Residuals have constant variance at every level of x - Expected value of residuals is 0 - Data follows a Gaussian distribution **Simplest model:** \ [\$\$f\\left( \\mathbf{x,w} \\right) = w\_{0\\ } + \\ \\sum\_{j = 1}\^{D}{w\_{j}x\_{j}}\$\$]{.math.display}\ - [*w*~0 ~]{.math.inline}= bias - [x ]{.math.inline}= data point, a vector containing the input features - [**w**]{.math.inline} = weight vector, containing all weights for each of the features **Linear regression with basis functions:** Enhancement of simple model, works on non-linear data as well. \ [\$\$f\\left( \\mathbf{x,w} \\right) = w\_{0\\ } + \\ \\sum\_{j = 1}\^{M - 1}{w\_{j}\\phi\_{j}}\$\$]{.math.display}\ - [*ϕ*~*j*~]{.math.inline} = basis function - It's still linear function of the weights, but now it's non-linear function of the input vectors x. - Adding basis function may enlarge dimensionality Convenient to add dummy basis function [*ϕ*~0~(*x*) = 1]{.math.inline} to get rid of the bias: \ [\$\$f\\left( \\mathbf{x,w} \\right) = \\sum\_{j = 1}\^{M - 1}{w\_{j}\\phi\_{j}} = \\ w\^{T}\\phi(x)\$\$]{.math.display}\ - This is an augmented notation, since the bias is no included in the vectors. Thus the dimensionality has increased by 1. **Sensitivity:** assume we keep all other variables fixed, how much would the estimated label [*ŷ*]{.math.inline} change if value of [*x*~1~]{.math.inline} would be increased/decreased by 1? Add error residuals to model to guess model performance: \ [*f*(**x,** **w**)= *w*^*T*^*x*+ *ε*]{.math.display}\ - [*ε*]{.math.inline} = vector containing all residuals **How to derive weights w?** Assuming the augmented vector notation of the model: \ [*y*= *Xw*+ *ε*]{.math.display}\ We would like to minimize the squared error (L~2~) of the model: \ [argmin~*w*~\|\|*ε*\|\|^2^= argmin~*w*~\|\|*y*−*Xw*\|\|^2^= argmin~*w*~\|\|*y*− *ŷ*\|\|^2^]{.math.display}\ Set derivates with respect to w to 0 to get minima of the loss function, we obtain: \ [*w* = (*X*^*T*^*X*)^ − 1^*X*^*T*^*y*]{.math.display}\ - This is the value for w that minimizes the loss function L~2~, so gives us the best predictions - [(*X*^*T*^*X*)^ − 1^]{.math.inline} = inverse of covariance matrix Alternative Loss is L~1~ -- less sensitive to outliers **Regularization of linear regression:** - Ridge regression: quadratic loss on residuals with L~2~ norm penalty on weights. Analytical solution for w. Strong influence of outliers - Lasso: quadratic loss on residuals with L~1~ norm penalty on weights. No analytical solution, but sparse in w. Reduces influence of outliers **Pros:** - Efficient and fast - Easy to implement and interpret **Cons:** - Assumptions - Sensitive to outliers **[SPoC: Source Power Comodulation]** Idea: extract oscillatory features from a multi-channel recording. CSP decomposes original channel space linearly into subspace components. It represents each oscillatory subspace by a spatial filter. - supervised learning - Place in pipeline: feature-extraction/encoding - Typically used to extract neural features that correlate either with a behavioural measurement (reaction time) or with an external stimulus parameter (intensity) - Assumptions made - Labels are continuous **Versions of SPoC:** 1. **SPoC~r2~ :** Maximize correlation between bandpower of estimated source and the known continuous labels -\> leads to non-convex optimization problem 2. **SPoC~λ~ :** Maximize covariance between bandpower of estimated source and the known continuous labels -\> generalized eigenvalue problem **Optimization problem for SPoC~r2~:** Find a spatial filter w such that the resulting source's bandpower is maximally correlated with target variable z: \ [\$\$J\\left( W \\right) = \\ \\text{argmax}\_{w}\\frac{{Cov(w\^{T}\\left( C\\left( e \\right) - \\overline{C} \\right)w,\\ \\ \\ z\\left( e \\right))}\^{2}}{Var(w\^{T}\\left( C\\left( e \\right) - \\overline{C} \\right)w)Var(z\\left( e \\right))}\$\$]{.math.display}\ - e = epochs - z(e) = continuous epoch-wise labels - C(e) = epoch-wise covariance matrices - [\$\\overline{C}\$]{.math.inline} = average covariance matrix over all epochs Once spatial filter w is determined, it derives a single virtual SPoC-channel from the multichannel data X by: \ [*SPoC* *channel*= *w*^*T*^*X*]{.math.display}\ **Optimization problem for SPoC~λ~:** Find a spatial filter w such that the resulting covariance matrix between the source's bandpower and target variable z is maximal: \ [\$\$J\\left( W \\right) = \\ \\text{argmax}\_{w}\\frac{{Cov(w\^{T}\\left( C\\left( e \\right) - \\overline{C} \\right)w,\\ \\ \\ z\\left( e \\right))}\^{2}}{w\^{T}\\overline{C}w} = \\ \\frac{w\^{T}\\left\\langle C\\left( e \\right)z(e) \\right\\rangle w}{w\^{T}\\overline{C}w}\$\$]{.math.display}\ - Solving this eigenvalue problem delivers filter matrix W - Use filter with strongest eigenvalues to derive informative SPoC components - Very similar to CSP **Pros:** - SPoC~λ~ is very fast to train (analytical solution!) - SPoC~r2~ can deliver slightly better results and it can be initialized using a filter derived by SPoC~λ~ - An early spatial filtering step reduces dimensionality **Cons:** - Sensitive to outliers - SPoC can overfit - Hyperparameters need to be determined **Regularized SPoC:** \ [\$\$\\text{argmax}\_{w}\\frac{W\^{T}\\sum\_{z}\^{}W}{{(1 - \\alpha)W}\^{T}\\sum\_{\\text{avg}}\^{}W + \\ \\alpha P(W)}\$\$]{.math.display}\ - Penalty function P with strength α - Convex trade-off between [\$W\^{T}\\sum\_{\\text{avg}}\^{}W\$]{.math.inline} and [*P*(*W*)]{.math.inline} depending on α

BCI methods overview.docx

Document Details

Tags

Related

Full Transcript