Advancing the detection of steady-state visual evoked potentials in brain–computer interfaces

Objective. Spatial filtering has proved to be a powerful pre-processing step in detection of steady-state visual evoked potentials and boosted typical detection rates both in offline analysis and online SSVEP-based brain–computer interface applications. State-of-the-art detection methods and the spatial filters used thereby share many common foundations as they all build upon the second order statistics of the acquired Electroencephalographic (EEG) data, that is, its spatial autocovariance and cross-covariance with what is assumed to be a pure SSVEP response. The present study aims at highlighting the similarities and differences between these methods. Approach. We consider the canonical correlation analysis (CCA) method as a basis for the theoretical and empirical (with real EEG data) analysis of the state-of-the-art detection methods and the spatial filters used thereby. We build upon the findings of this analysis and prior research and propose a new detection method (CVARS) that combines the power of the canonical variates and that of the autoregressive spectral analysis in estimating the signal and noise power levels. Main results. We found that the multivariate synchronization index method and the maximum contrast combination method are variations of the CCA method. All three methods were found to provide relatively unreliable detections in low signal-to-noise ratio (SNR) regimes. CVARS and the minimum energy combination methods were found to provide better estimates for different SNR levels. Significance. Our theoretical and empirical results demonstrate that the proposed CVARS method outperforms other state-of-the-art detection methods when used in an unsupervised fashion. Furthermore, when used in a supervised fashion, a linear classifier learned from a short training session is able to estimate the hidden user intention, including the idle state (when the user is not attending to any stimulus), rapidly, accurately and reliably.


Introduction and motivation
Steady-state visual evoked potentials (SSVEP) refer to the involuntary brain response to repetitive visual stimulation of the eye in humans and some non-human primates [1]. The measured scalp Electroencephalographic (EEG) potentials are characterized in the frequency domain by constant amplitude and phase at the Fourier components of the stimulation frequency and its higher harmonics. SSVEPs can be observed when the driving frequency of the stimulus is in the range 4-100 Hz [2], below which the electrical excitations of the visual system is able to abate before the new stimuli are presented, and in this case, it is the transient VEPs that can be observed [3]. Capilla et al [4] showed that steady-state VEPs can be accurately predicted from the linear summation of appropriately constructed transient responses and concluded that both can be attributed to the same underlying neural mechanism.
Due to several factors, SSVEP has been a core concept in non-invasive EEG-based brain-computer interface (BCI) applications. Among these are its robustness and relatively high signal-to-noise ratio (SNR), the high information transfer rate it delivers and the short training time required, if any, before it can be used for online applications. Typically, in SSVEP-based BCIs, different frequency-tagged stimuli are displayed simultaneously with each stimulus given a predefined mapping to a system command. This mapping is also known to the user, who can control or communicate with the system, simply by attending to the stimulus corresponding to the command of interest. Relatively easy discrimination between the different frequencies is facilitated by the fact that selective attention to stimulus location modulates SSVEP [5].
Early investigated methods for SSVEP analysis and detection have relied mainly on power spectral density analysis with fast Fourier transform (FFT) applied on singlechannel EEG data. Herewith, the temporal EEG signal is transformed into its Fourier representations where test statistics can be derived from obtained information about the power (or amplitude) [3,[6][7][8][9], phase [10] or both [11], at all considered driving frequencies. FFT-based methods require relatively long data segments to give reasonable detection results since the frequency resolution (Df ) in the Fourier domain is determined by the reciprocal of the temporal data length available (e.g. it is required to have a 4s data segment to get a frequency resolution of D = f 0.25 Hz). Known issues with FFT like the grid effect (Fourier components cannot be computed for frequencies that are not an integral multiple of Df ) and spectral leakage (energy spillover from one frequency bin to adjacent ones due to rectangular windowing) highly affect the calculated amplitude and phase precision and should be accounted for by choosing suitable window functions and segment lengths [5,12,13]. In [14], authors argue that for an arbitrary steady-state recording, there might not be enough information to benefit from phase information in SSVEP detection.
Recently, there has been a great tendency towards different methods that rely on spatial filtering of multi-channel EEG data, which proved to be more efficient and stable than FFT-based methods. The basic idea here is to find a spatial filter that transforms the original multi-channel EEG signals into single or multi-channels with desirable characteristics. Friman et al [15] proposed the minimum energy combination (MEC) spatial filter, the first of its kind, which aims at minimizing the noise energy, with the noise defined as the orthogonal complement to the projection of the EEG signals onto the subspace spanned by the pure SSVEP vectors. Alternatively, the canonical correlation analysis (CCA) aims at finding a pair of spatial linear combinations for both, the EEG signals and the assumed pure SSVEP responses, which jointly maximize the correlation between the resulting canonical variates [16]. The maximum contrast combination (MCC) filter maximizes the SNR, defined with the generalized Rayleigh quotient [15,17]. Finally, the multivariate synchronization index (MSI), through spatial whitening, extracts a single metric that reflects the synchronization level between the EEG signals and assumed pure SSVEPs [18].
The main contribution of this work is that we show the similarity of the spatial filters and the scoring functions used by the standard unsupervised CCA, MSI and MCC approaches to SSVEP detection, where we conclude also that their detection accuracies should not differ significantly. We show additionally that these methods, in low SNR regimes, fail to provide a reliable detection as they ignore the noise power in the spontaneous EEG that overlaps the stimulation frequencies.
On the other hand, parametric spectral analysis, like autoregressive methods, are able to provide an estimate for the noise power levels after removing the total energy of the driving frequencies from the spatially filtered signals. This was shown to provide reliable test statistics for SNR that can be used for SSVEP detection [15,19]. Based on the theoretical analysis of these state-of-the-art methods, we propose a new method, namely the canonical variates with autoregressive spectral analysis (CVARS) that estimates the signal and noise power levels from the canonical variates which leads to slight improvement in correct detection rates. This method, when used in a supervised fashion, results in a rapid, accurate and reliable SSVEP detection regardless of the level of SNR as shown from the experimental results with real EEG data. We additionally show the conditions on which the MEC-based detection provides similar results to the CVARS method.

Notation
We use bold uppercase letters (e.g. Z) to denote twodimensional matrices and bold lowercase letters (e.g. z) to denote column vectors, and conversely z T for row vectors. The Frobenius norm of the matrix Z is denoted by   Z F . A p-variate random variable is denoted by an uppercase letter, where it refers to a vector of p random variables, e.g. [ . Realizations of Z are denoted the same way as vectors (e.g. z). A sequence of n p-dimensional vectors is concatenated in the sample data matrix The autocovariance matrix of p-variate random variable X is denoted by C xx and the cross-covariance matrix between two random vectors X and Y is denoted by C xy . Often in practice, the true value of such covariances remains unknown, and therefore they are replaced by the sample auto and crosscovariance matrices Q xx and Q xy , computed from the centered sample matrices with = Q X X , where N is the number of available samples in X and Y. Unless explicitly stated otherwise, all sample matrices encountered here will be centered or made centered by subtracting the sample mean from all observations.

Background
We assume that after the retina is excited by flickering light, and after the transient VEPs vanish [20], pure SSVEP responses appear as multiple phase-shifted sinusoidal waves whose frequencies are integer multiple (up to N h ) of the driving frequency. The assumed pure waves propagate to the scalp where EEG signals are measured. Due to volume conduction of the head, a linear combination of these source signals corrupted with noise will be observed at each measurement location, i.e. electrode. The additive noise might have encephalic or non-encephalic sources, and is generally non-stationary. However, for short EEG segments, it is often assumed to be stationary in the wide sense [21]. No statistical knowledge about the noise is additionally assumed here. Formally, we state that in response to flickering light with driving frequency f, the values recorded over time at each electrode i can be written as With conformable transformation of the amplitude values a i h , , the above notation can be rewritten for a collection of N samples from all N y electrodes (see figure 1) in matrix form as Here , where Q xx and Q xy respectively denote the sample spatial autocovariance matrix of the pure SSVEP response and the sample crosscovariance between the acquired EEG data and the assumed source model, both are estimated from N observations. Despite the fact that the source signals are defined deterministically, they can be modeled as stochastic with empirical means computed on segments of length N, which can be approximated with the zero-vector 0 for large N. Due to the orthogonality of the basis vectors in X, the sample covariance matrix Q xx , for centered and normalized X, can also be approximated for relatively large N with I N x , where = N N 2 x h . In this formulation, the number of harmonics in the SSVEP response, i.e. N h , is not a random variable but rather an unknown deterministic value.
Where it is necessary, in order to avoid ambiguity when we refer to the source model for the different driving frequencies, the subscript f will be added to matrix X f to indicate the source model of the driving frequency under consideration.

Stimulus identification during SSVEP-based interaction
During concurrent repetitive visual stimulation, an SSVEP detector aims at finding the stimulus, at which the user is attending, based on multi-channel EEG data segment  ÎÝ N N y , obtained online from continuous scalp EEG data by means of buffering (with buffer length N and buffer ). Based on the requirements of the application at hand, the buffer length and the interaction temporal resolution (T) are determined. T is defined as the shortest time (in samples) between two consecutive commands.
The overlap can be computed from  when the user does not attend to any of the stimuli. Often, ( ) g Y is defined as the argument which maximizes a score function or a test statistic s. SSVEP detection can be formally written asˆˆ( , . 5 We denote by ( ) g Y the ground truth frequency of the stimulus, to which the user attends while Y is being acquired. The score ( ) s Y X , f 0 is considered here as to test whether or not a given response is statistically significant and not due to noise fluctuations and background EEG. More often than not, it is defined as a constant threshold which is either computed from the EEG data segments themselves or a priori computed from training data. For completeness, let ( )  Î +ś M 1 1 denote the vector containing the score value for all frequencies plus the idle state. Figure 2 depicts the schematic of SSVEP detection in continuous EEG data. Available scoring functions used in(5) will be discussed in section 3.

Evaluation
The different detection methods, which will be discussed later, are compared with regard to their average accuracy (P D ), which is typically computed from labeled EEG segments as the ratio of the correctly classified segments to the total number of available segments. Average misclassification error can be easily computed with= -E P 1 m D .

Challenges in SSVEP-based interaction
The user in SSVEP-mediated applications continuously shifts his/her gaze between the M spatially distributed stimuli for active control and towards the stimuli-free areas of the display for the idle control state. Obviously, such interaction is asynchronous and completely paced with the user actions. However, and regardless of the SSVEP detection method used, recognition of these asynchronous spatial attention shifts does not happen usually on the spot due to inherent limiting factors of the buffering stage. The shorter the buffer size (small N and T), the faster is the response of the system. With larger buffer sizes, however, the temporal random fluctuations in the score function ( ) s Y X , f are made less severe as the noise (the assumed source of variability in the evoked potentials) attenuates typically in proportion to the square root of the number of time averages done on the data. On the other hand, larger buffer sizes introduce delays into the system and reduce the achievable bit rate. Consequently, finding a trade-off between interaction accuracy and speed is of high importance for practical systems.

SSVEP detection methods
In the next sections, we will provide the theoretical background for the spatial filtering methods used in state-of-the-art SSVEP detection and show in which ways they differ and how the spatial filters obtained are related to each other. Additionally, we propose a new detection method that performs superiorly in different SNR regimes as it combines CCA spatial filtering and autoregressive spectral analysis to estimate the signal and noise levels at all possible driving frequencies.

Canonical correlation analysis
Lin et al [16] used CCA to recognize the narrow-band driving frequency of SSVEPs from EEG data. The CCA-based method was found to outperform the FFT-based spectrum estimation method in terms of classification accuracy. This result has been repeatedly reported in [8,22]. The superior performance can be attributed to the ability of CCA to reveal spatial coherence in data contaminated by either white Gaussian noise or colored noise fields, should the data have high SNR [23].
CCA [24] does that by finding the maximally correlated pairs among all possible linear combinations of two zeromean multivariate random variables X and Y, where  Îx Without loss of generality, we assume in the following that  N N x y . Formally, we look for the canonical weight vectors w x and w y where = x w x x T and = y w y y T , such that the correlation coefficient between the canonical variates x and ( ) r y x y , , 1 is maximized. By definition Since scaling of w x and w y does not affect the objective function, the search space is limited by constraining the variance of the variates x and y to be 1 [25]. This leads to the By introducing Lagrange multipliers, one can easily obtain the following generalized eigenvalue problems Due to the fact that C xx and C yy denote covariance matrices, which are symmetric positive semi-definite,(7) and(8) can be rearranged into two standard symmetric eigenvalue problems and similarly ¢ = w C w y y y y 1 2 . The matrix T is referred to as the coherence matrix and denotes the cross-covariance between the whitened vectors -C x xx 1 2 and -C y yy 1 2 . This yields that . The matrix product TT T is often referred to as squared coherence matrix [26]. Other uncorrelated canonical variates can be found using the remaining eigenvectors and eigenvalues [25].
For later use, we define the decomposition x . The score function in(5) can thus be defined for the standard CCA method as r = s cca 1 . Alternatively, other score functions can be derived as an arbitrary function of ( ) Since true covariance matrices are not known a priori, the coherence matrix is defined with the empirical estimates thereof. Recall that » Q I xx N x . Given enough samples for Y, i.e.  N N y , then Q yy will be full rank and invertible. Equivalently, the canonical correlations [27] can be found by first applying the QR decomposition of = Y Q R y y and = X Q R x x , to obtain the orthonormal matrices Q x and Q y and the full rank R x and R y matrices. The second step involves the singular value decomposition of Q Q This formulation is computationally more efficient and provides more insights on the geometric interpretation of CCA. Hereby, the canonical correlations correspond to the cosine of the principal angles between the two subspaces spanned by the column spaces of Q x and Q y or formally, . Geometrically, the maximum canonical correlation is the cosine of the smallest angle possible between any two vectors in the subspaces spanned by Q x and Q y . Figure 3 illustrates this relation.

Multivariate synchronization index
More recently, Zhang et al [18] introduced the MSI for online SSVEP detection, where the synchronization level between the source model and the acquired EEG is measured based on the S-estimator [28]. The joint covariance matrix C X Y , which includes the auto and cross-covariance matrices of X and Y can be written in block form as The transform U that orthogonalizes the diagonal block matrices, i.e. whitens the original data matrices X and Y, was , and the normalized eigenvalues to be defined as l ¢ = l i P i . The synchronization index then can be obtained from the entropy-like quantity The MSI-based score is tightly related to s cca , as the eigenvalues of R are nothing more than a simple transformation of the canonical correlations, such that l r = + 1 , . This renders s msi as a mere nonlinear function of all canonical correlations, and as special case of the s fcca . Additionally, the filtering step (i.e. whitening) involved is similar to that in the CCA method.

Minimum energy combination (MEC)
Friman et al [15] proposed to apply the spatial filter  ÎẂ N N MEC y s , which minimizes the noise energy in The noise here is defined as the difference between the original EEG signal and its best LS approximation in the subspace spanned by the SSVEP sinusoids. The noise is thus estimated as 2 estimates the signal power andŝ kl provides an estimate to the noise power at the k-th harmonic in the l-th spatially filtered signal.ŝ kl is obtained by fitting a p-order autoregressive AR(p) model, with parameters {ˆˆˆ} a a s ,..., , where =j F 1 , s is the sampling frequency and f is the stimulation frequency.
The discrimination power of the statistic in(15) stems from its ability to incorporate the noise power estimate at the frequencies under consideration. So far, the score functions in CCA and MSI reflected the signal power only. Furthermore, we can write the noise covariance matrix as

Maximum contrast combination
The goal of the MCC method is to find the linear spatial filter that maximizes the generalized Rayleigh quotient [15,17]. Formally, we are after w which maximizes l = with respect to the Euclidean norm, and picking the one corresponding to r max . This proves that MCC and the CCA methods have exactly the same discrimination power since the function ( ) r f mcc 1 is monotonically increasing in r 1 .

Canonical variates with autoregressive spectral estimation of noise (CVARS)
As stated earlier, the standard CCA method is able to reveal spatial coherence in high SNR regimes. When  SNR 1, however, the canonical coefficients mainly reflect the correlation between the noise and the assumed source signals, which often leads to erroneous detection as the separate contribution of the signal and the noise to the total values of r i cannot be determined.
Therefore, we propose to estimate the noise at each frequency for each signal after spatial filtering with parametric spectral density estimation. That is, the noise power is estimated by fitting an autoregressive model to canonical variates after cleaning them from all energy at the SSVEP driving frequencies and their higher harmonics. The test statistic s cvars is computed exactly as in (15) with the AR(p) models fitted on˜( ) = --S S X X X X S , where w kl,1 and w kl,2 denote the respective weight of the sine and cosine signals of the k-th harmonic in the l-th canonical variate.

Discussion
In the light of the previous analysis and findings, it is obvious that the scores of the CCA, MCC and MSI methods correspond to different functions of the canonical correlations (r i ), and their scores can all be considered special cases of s fcca . Common to all of them is the spatial pre-whitening step of both, the EEG signals and the assumed SSVEP pure response, which is spatially white by construction. Though it is not necessary to do so to obtain the scoring functions s cca and s mcc , these methods additionally involve applying the transform -C yy 1 2 to the left singular vectors of the coherence matrix in order to obtain the optimal spatial filter. The spatial filtering of the MEC and CVARS methods is accomplished with the eigenvectors of the noise covariance matrix and the canonical weighting matrix W y , respectively. Additionally, both involve spectral analysis of AR(p) models fitted on the spatially filtered data cleaned from the energy at the driving frequencies of the SSVEP. The rationale behind using3.3 in [15,19] is that the temporally colored noise in each columns l in the matrixS, can be modeled as discrete-time autoregressive random process of order p, u n is white Gaussian noise of power s 2 and a = 1 0 . As a result, this modeling allows to whiten the temporally colored noise, and produce an unbiased (or with small bias [19]) estimate for the noise power at each stimulation frequency and its higher harmonics. The CVARS method, therefore, involves whitening the data, both spatially and temporally, before it can provide the scoring function s cvars . If the data is spatially pre-whitened before applying the MEC method, then results will be very similar to those of the CVARS method, but not exactly the same as N s used to compute s mec is governed by a fraction of the total energy in the noise signal and in case of CVARS, = N N 2 s h is fixed.

Material and methods
In order to compare the different detection methods, several experiments with volunteer subjects were conducted.

Subjects
A total of 10 healthy adults (1 female) aged 29.3±5.5 (range 22-39) with normal or corrected-to-normal vision served as paid volunteer subjects in this study. During the experiments, the participants were seated 0.65 m away from an liquidcrystal display (LCD) monitor on a comfortable armchair in a slightly dimmed room. All participants gave their written informed consent. Participants were additionally asked to fill in pre-and post-questionnaires, that were meant to collect data about the level of tiredness before and after the experiment in addition to some demographical data.
Scalp EEG signals were recorded from 16 electrodes positioned according to the international extended 10/20 electrode system over the parieto-occipital scalp areas at P3, Pz, P4, PO9, PO7, PO3, POz, PO4, PO8, PO10, O9, O1,Oz, O2, O10 and Iz. Electrodes were referenced to the right earlobe and the ground electrode was positioned at FPz. The signals were acquired with sampling rate of 512 Hz using g. USBamp acquisition system (g.tec medical engineering GmbH, Schiedlberg, Austria) and band-pass filtered at -0.5 60 Hz. The power line interference at 50 Hz was removed with a 4 th order butterworth notch-filter with 48-52 Hz stop band. All electrodes were filled with highly conductive gel in order to reduce impedance.
This study is part of a larger project, which is approved by the Ethics Committee of the Faculty of Medicine of the Technical University of Munich (TUM).

Experimental paradigm
A  22 LCD monitor 3 with 60 Hz refresh rate and 1280×1024 resolution was used to view the stimuli which consisted of four spatially distributed flickering rectangles presented simultaneously to participants. The driving 3 In addition to the LCD monitor, participants viewed stimuli via a headmounted display (HMD) from Oculus VR, USA. The HMD has 60 Hz refresh rate as well. Each subject sequentially viewed the stimuli on both displays (i.e. the monitor and the HMD) with either their left eye only, right eye only, or both eyes. This resulted in a total of 2×3 different viewing conditions which were pseudo-randomized across subjects. Subjects were assigned either to finish all the HMD or the LCD conditions at first to minimize electrode displacement that might take place after mounting/ unmounting the HMD. Two consecutive sessions per condition were recorded for later analysis. In total, each subject underwent 12 sessions, each of around 5min duration. The whole experiment lasted around two hours including preparations and rest breaks between sessions. In order to provide results that can be compared with those in the literature, we will base our evaluation throughout this work solely on the two recording sessions obtained with binocular LCD viewing. frequencies of the stimuli were chosen as integer divisors of the display refresh rate, namely 15, 12, 10 and 8.57 Hz. The chosen and fixed driving frequencies are known to evoke moderate to high SSVEP's amplitude strength [29]. The spatial distribution of the stimuli is shown in figure 4. The EEG acquisition and visual stimulation were running on two different computers, and synchronized with the screen overlay control interface [30,31].
Two recording sessions were performed, each started with a blank screen for around 15 s followed by the presentation of the flickering stimuli. During stimulation, participants were instructed to overtly sustain the spatial attention on the cued stimulus. Stimuli were highlighted in turn with a green rectangle as shown in figure 4 for 10 s followed by a rest period of 3.5 s, in which the screen went blank. The stimuli were cued in the descending order of their driving frequencies (i.e. the sequence will be top, left, right and bottom according to the stimuli constellation shown in figure 4). Detailed timing of one complete sequence is shown in figure 5. Each recording session consisted of five such full sequences.

Experimental results
The accuracy for each of the different detection methods is highly influenced by the choice of the key system parameters, e. g N T N , , h and F s . In the following, we will firstly highlight the effect of each of these parameters individually on unsupervised CCA detection accuracy. We chose the CCA method here as the other scoring functions can be obtained from the canonical coefficients, canonical variates and the canonical weights, and thus it can serve as an indicator of the information gain/loss that accompanies parameter change. By fixing these parameters in the light of the empirical evaluation of CCA, the unsupervised detection accuracy of the CVARS is then compared to the state-of-the-art methods.    Figure 6 shows the average misclassification error (E m ) for all subjects as a function of the buffer size, when the CCA method was used with N h =1 and = F 512 Hz s . With larger buffer sizes, one can observe that misclassification errors for all subjects get suppressed due to enhanced SNR and more accurate estimates of Q yy and Q xy . Figure 7(a) shows the average misclassification error of the CCA method (averaged over all subjects) for different number of harmonics N h that ranges from 1 to 6 as a function of buffer length. For > N 3 h , no further improvement is observed, which cannot be explained by the fact that EEG data itself was lowpass-filtered with a cutoff frequency 60 Hz since the fourth harmonic of the maximum stimulation frequency is not rejected thereby. However, the fourth harmonic of the driving frequency = f 12 Hz 2 , lies within the stop band of the notch filter. In order to fully isolate the influence of the notch filter, we reevaluated the average misclassification error of the unsupervised CCA method (excluding f 2 from the analysis). Again, the results shown in figure 7(b), suggest that no further improvement for > N 3 h . The increased accuracy for > N 1 h is however not statistically significant as it was also reported in [22]. The value of N h is set to 3 throughout the remaining of this work.
As has been mentioned earlier, one can derive arbitrary scoring functions s fcca from the canonical correlations r i . The effect of changing the sampling rate is shown in figure 9. Downsampling with a factor of 2 or 4, which respectively resembles sampling frequency of 256 and 128 Hz leads to accuracies that are comparable with the full data segments (with F s =512). Estimation of covariance matrices Q yy and Q xy is not, therefore, significantly affected by downsampling. This behavior suggests that adjacent samples in Y are correlated (and they are) and that allowing for Dt between samples that is larger than the expected maximum correlation lag would not affect the obtained results. Going to downsampling factor of 8 ( F s =64) deteriorates the performance significantly, as this would introduce aliasing and loss of information of the higher harmonics (with frequencies larger than F 2 s The estimates of the detection accuracies obtained so far using the non-overlapping data segments can be a bit misleading as classification is performed only on homogeneous EEG data segments, during which subjects attended to one single driving frequency. However, during online usage, EEG data collected in one segment can reflect two or more different states of user attention (e.g. user shifts his/her attention from one stimulus to another or from the idle state to one of the active states within the segment). This behavior becomes more probable with larger buffer sizes. To simulate this case, maximally overlapping data segments (i.e. T = 1 sample) were continuously extracted from the data and used to plot the CCA score evolution of the different stimulation frequencies over time. The resulting segments thus contain both homogeneous and heterogeneous data. In order to additionally provide more insights about the inter-subject variability in figure 6, the score evolution will be shown for the subjects S5 and S2, whose CCA results were the best and the worst, respectively. CCA score evolution during the first full sequence (after viewing all stimulation frequencies once) for ( ) ( ) = N T , 1024, 1 samples is shown in the upper row of figure 13. These plots show, to some extent, that during stimulation with f, the score ( ) s Y X , f increases over time and surpasses the scores of the other frequencies. Additionally, among the used stimulation frequencies, one can observe for each subject, that a specific frequency is somewhat dominant throughout the whole sequence (12 Hz for S5 and 10 Hz for S2), and to this specific frequency most of the faulty detections and false alarms can be attributed. While high scores for the dominant frequency (most likely due to interference from the alpha brain band waves, within which a peak can be  observed in most subjects' EEG [32]) starts to appear during the idle state, they get suppressed (though not always) when subjects shift their visual attention to flickering light. Furthermore, one can categorize the SSVEPs of the two subjects into high and low SNR with respect to the obtained CCA score values, where S2 is the one with the low SNR.

Distribution of canonical correlations
In the following, we will refer to the canonical correlation values obtained when users attended to a specific stimulus with frequency f as the target canonical correlations (or target scores) for that frequency. Nontarget scores of a stimulation frequency f, on the other hand, refer to the values obtained when the user attended to other or no stimuli. By fixing N h =3 and ( ) ( ) = N T , 1024, 1 samples, we computed the target and nontarget correlation coefficients for all stimulation frequencies in all recording sessions for all subjects. We assigned a data segment to a frequency f, if the most recent sample in the buffer was obtained when the corresponding stimulus was then cued for viewing.
From the histogram of all these values, we estimated the distribution of all target and nontarget canonical correlations which are shown in figure 10. These plots show that the difference between the distributions of the target and nontarget canonical correlations r i is most pronounced for r 1 . Most importantly, we could see that the typical means (for all canonical correlations) per stimulation frequency differ significantly, in a way that reflects the general power density of EEG data which, similar to pink noise, exhibits a characteristic 1/f profile [32,33]. The mean and standard deviation of all target and nontarget canonical correlations are shown in figure 11 as a function of stimulation frequency f. Besides the 1/f profile, we can observe a peak at 10 Hz, which stems from interference of the dominant alpha brain waves at the same frequency. The distributions of the target and nontarget r i ʼs clearly justify why s fcca defined as the sum of canonical correlations did deteriorate detection performance when compared to s cca , as the bias towards the low frequencies and the subject-dependent peaks in the alpha band, increases by  adding further correlations to the value of r 1 . Therefore, the CCA scores need to be scaled differently for each frequency in order to correct for the observed bias. This scaling is done efficiently in the CVARS method.

Comparison of the different methods
The same non-overlapping EEG segments from section 5.1 were used to compare the CCA, MCC, MSI, MEC and CVARS methods. The mean misclassification error averaged over all subjects is shown in figure 12 for N h =3 and = F 512 Hz s . MEC and CVARS were used with AR(7) model. By visual inspection, the results of the CCA, MCC and MSI methods do not differ significantly. The MEC and CVARS methods outperform CCA for all buffer lengths, except for 0.5 s buffers, in which case, all methods have comparable accuracies. The CVARS method performs slightly better than the MEC for almost all buffer lengths.
Furthermore, the upper three rows in figure 13 compare the CCA, MEC and CVARS with respect to their score evolution during the first full stimulation sequence with Post hoc tests using the Bonferroni correction revealed statistically significant ( ) = p 0.005 improvement of unsupervised CVARS (mean ± std: 0.739 ± 0.099) over unsupervised CCA (mean ± std: 0.64 ± 0.137). Furthermore, we found a trend for unsupervised CVARS being superior over unsupervised standard MEC, which given the current number of subjects though is not significant ( ) = p 0.069 . Unsupervised standard MEC (mean ± std: 0.72 ± 0.107) performed significantly better than the unsupervised CCA ( ) = p 0.003 . The effect of choosing longer buffers (N = 2560) with CVARS method is shown in the lower panel of figure 13. We can easily observe that fluctuations in the score function are reduced for both S2 and S5, with the side effect of introducing delays into the system (rise-up and decaying delays). This, however, did not guarantee reliable SSVEP detection for S2.

Supervised CVARS method for SSVEP detection
The unsupervised detection methods discussed so far provided estimates of the driving frequency in a WTA fashion. This way, the system is always in the control loop and the idle state is never reached. Furthermore, the target and nontarget scores distributions in figures 10 and 13 suggest that the simple argmax function in(5) will provide faulty detections due to the large visible difference in the mean scores for the different driving frequencies, and the resulting bias towards the lower frequencies. Learning a simple threshold from labeled data (obtained from a training session for each user), although can reduce the rate of false alarms and faulty detections, can lead also to a larger rate of misses as can easily be seen for CCA from the plots in figure 10 and for all methods in figure 13. For instance setting a CCA threshold of 0.5 in figure 10 would result in a high probability of miss, as can be computed from the area under the curves of the target distributions for r < 0.5 1 . It can be additionally observed that it is more probable to miss the detection of higher frequencies than the lower ones. Linear discriminant analysis (LDA) lends itself naturally to such a problem, where labeled score vectors are obtained from one training session and an LDA classifier is thereby learned. In online sessions, the scores are computed as usual to produce the vector  Îś M 1 . The LDA classifier is then applied on s to provide the final estimate of the hidden attended stimulus. We summarize the comparison between the unsupervised and supervised CVARS with ( ) ( ) = N T , 1024, 128 in the form of the confusion matrices shown in figure 14, for subjects S5 and S2, in addition to the average over all subjects. We chose CVARS here as it provided the best results in the unsupervised case. The confusion matrices were computed with each method applied on the two available recording sessions. In case of the supervised CVARS, a classifier was learned from each session and applied on the other one, and results therefrom were averaged (i.e. a total of two classifiers were used per subject). These results clearly show a reduction in the rate of false alarms on the expense of higher probability of misses. The probability of miss and that of correct detections are quite uniform with respect to the stimulation frequencies, on the contrary to what is expected if a single threshold were used. Furthermore, a large portion of the probability of miss in the supervised CVARS replaces some of the wrong detections in the unsupervised case. Falsely detecting one control state as the idle state is generally favored over confusing it with another control state.  . The zero level corresponds to the idle state. The fourth level (i.e. for f 4 ) is scaled to the largest score obtained for better visibility.

Discussion
From the results in the previous sections, we can see that there are conflicting factors that affect the detection accuracy of SSVEP. On one hand, accuracy is a monotonically increasing function of the buffer size (i.e. N), should the buffer contain only homogeneous data, which is not the most probable case in online applications. On the other hand, larger buffer sizes introduce delays, as the old samples which contain no information about the currently attended stimulus still inhabit the buffer Y and contribute to the different values ( ) s Y X , f , leading to false alarms and faulty detections.
Comparing figures 6 and 15, which respectively show the misclassification results of non-overlapping segments as a function of buffer length for CCA (N h =1) and CVARS (N h =3), we can see that buffer lengths, at which acceptable accuracies are obtained, differ from one subject to another, regardless of the detection method used. Therefore, a trade-off between accuracy and speed should be optimized for each subject, based on a short training session. The same session is also used to learn the LDA classifier from the obtained scores.
Additionally, with our theoretical analysis and empirical results, we have shown that CCA and MCC give exactly the same estimates and thus the same misclassification error rates. MSI and CCA have shown similar misclassification error rates, suggesting that r 1 plays a major role in calculating s msi . This can be seen by rewriting (13) as approximates ( ) r f relatively well when r < 0.5. Therefore, for the case when all r < 0.5 i , the MSI score function (and its quadratic approximation) gives more weight to greater r i ʼs. On the other hand, when r > 0.5 1 , the contribution of r 1 becomes more emphasized as the difference Figure 14. Confusion matrices for the unsupervised and supervised CVARS methods. The supervised CVARS method produces reliable results for all participants including S2, whose data has been shown to have low SNR. This proves the ability of supervised CVARS to deal with wide range of SNR levels.  qualitatively shows the relatively large contribution of the maximum canonical correlation in the final MSI score, especially when r > 0.5 1 grows very rapidly. The latter case is the more probable case given the empirical values of target r i ʼs shown in figure 10.
Furthermore, we have shown that MEC outperforms CCA. This result differs from what was shown in [34]. However, the score function s mec used there was defined witĥ s = 1 kl , which ignores the noise power at the stimulation frequencies and consequently MEC scores will be biased towards the low frequencies. This reaffirms the need to scale the different correlations with regard to estimates about the noise power, which is done efficiently with test statistics used in standard MEC and CVARS procedures. In [15], a test statistic similar to s mec is used with the MCC spatial filter, where the spatial filter is obtained from a subset of the columns in W y . Again, this statistic should give similar results to the CVARS when the data is pre-whitened before applying the MCC filter.
Alternatively, Yin et al [35] proposed the supervised CCA-RV method for the same purpose of reducing the variability in the final target scores of the different stimulation frequencies where s cca rv was computed with is the mean nontarget scores of a frequency f computed from a training session. Since the CCA-RV method was published very recently, we could not fully and fairly compare it to the CVARS method, especially also that real-time biofeedback mechanisms were employed in [35]. However, we claim that the supervised CVARS method described in this paper should outperform the supervised CCA-RV (when we ignore the effect of the biofeedback). Firstly, CVARS has been shown in the current paper to outperform CCA when results for all subjects were averaged and therefore plugging the CVARS scores instead of CCA in(21) is expected to provide, on average, better results. Additionally, LDA learns from the available training session the optimal mapping from the CVARS scores to the stimulation frequencies as it takes into account not only the mean values as it is the case in [35] but also the variances and covariations of the individual scores.
Throughout this work, we had a number of channels N y =16, which was larger than the number of signals assumed in the source model = = N N 2 6 x h . Reversing this relation does not affect the obtained results. For the CCA method, the number of canonical correlations is upper bounded by ( ) N N min , x y , which will be N y in this case. The eigenvalues in the MSI method, will be defined the same way as in section 3.2 except that there are -N N x y eigenvalues which have the value 1. The CVARS method, in this case, will provide no dimension reduction, as the number of canonical variates will be the same as the original EEG channels. On the contrary, the MEC method can always produce less output channels than available in the original EEG signals. Friman et al [15]  makes the CVARS method more consistent with the assumptions about the the number of source model signals.

Conclusion
Detection of SSVEP in continuous EEG signals lies under the general problem of detecting sinusoids in noisy measurements, a problem that has been thoroughly investigated in the array signal processing field. We have theoretically shown the conditions in which state-of-the-art SSVEP detection methods share similar spatial filters, a step required to enhance the overall SNR. The equivalence of the discrimination power of the MCC and CCA methods has been proven and it was conjectured that MSI should have very similar results as well. Empirical evaluations supported these results. The methods CCA, MCC and MSI rely on a single metric that is computed from the canonical correlations r i to provide an estimate about the stimulation frequency, to which a user is attending. Thereby they fail to provide reliable estimates when the signal is lost in the noise floor. The MEC and the hereby proposed CVARS methods, in addition to the step of spatial filtering, base their discrimination upon the estimated signal and noise powers at each considered frequency (the fundamental stimulation frequency and its higher harmonics). We have shown that the CVARS and the MEC scores are the same, given that the EEG signal is spatially whitened before running MEC algorithm and N s is rather fixed to ( ) N N min , 2 y h , i.e. number of canonical correlations. The CVARS method slightly outperformed the standard MEC method, which typically requires no spatial pre-whitening as reported in [15]. Finally, we have shown that the supervised CVARS method based on a short training session can be used to learn a mapping function rather than the maximizer (argmax) that estimates the hidden driving frequency and the idle state from the obtained scores, reliably and accurately. The training session should also serve the purpose of finding the optimal buffer size for a specific subject to be used in online applications.