Personnel: Ed Lalor, Inyong Choi, James Wright, Jonathan Brumberg, Malcolm Slaney, Nai Ding, Nima Mesgarani, Nils Peters, Barbara Shinn-Cunningham, Siddharth Rajaram, Sudarshan Ramenahalli, Adam O'Donovan, Jeffrey Pompe, Shihab Shamma
This is unpublished work in progress.
We would like to better understand how human listeners understand the acoustic world around us. Attending to just one of many auditory streams is an important part of this process. We would like to gain a better understanding of this attentional process. The goal of this project is to use EEG signals to determine to which sound stream a human subject is attending. Our scientific goal is to better understand auditory attention. Our practical goal is to make it possible for machines to benefit from human attentional judgements, and then perform additional processing. This is useful because a machine could then use an attentional signal to complement the skills of the human. Possible applications include: doing additional processing on the signal to which the human subject attends, doing processing on the signals that are being ignored by the subject, or using the attentional signal to enhance the attended signal.
Our basic experiment looks like this. A subject hears two different streams (A and B), but attends to only one of them. We record the EEG signals from the subject. We want to determine whether the subject is attending to stream A or stream B. We hope this experiment will tell us how the brain attends to one or the other signal.
This project builds upon previous results in three different domains: EEG (Electroencephalography), MEG (Magnetoencephalography), and ECOG (Electrocorticography). In the EEG domain, forward modelling of the impact of the speech envelope on EEG has met with some success. However, we show here that the reverse modeling approach may be much more sensitive for assessing attention using EEG. In the MEG domain, it has been shown that it is possible to decode MEG signals in order to determine to which of two speakers a person is attending. Given the somewhat complementary information given by MEG, it was unclear as to how well this might work with EEG. A similar point can be made about recent research using ECoG which had shown dramatic differences in the cortical representations of attended and unattended speakers. This study shows that reconstruction of speech spectrograms from ECoG recorded in Superior Temporal Gyrus from the responses to mixture speech resembles only the spectrogram of the attended speaker, as if the other speaker was not present. The invasive nature of this technique however limits its practical application which motivates noninvasive EEG studies. Again the transferability of such an approach to EEG was unclear, since EEG has lower signal to noise ratio and lower spatial resolution than ECoG and MEG.
This project is interesting and significant because it is a first demonstration that reliable real-time decisions can be made on how a subject is deploying their attention to speech in a multi-speaker environment. This may have ramifications for the design of future neural prosthetics such as hearing aids and cochlear implants and for the design of non-invasive brain-computer interfaces. Furthermore, it is the first demonstration that the envelope of speech can be reconstructed from EEG. While this was known from previous MEG and ECoG studies, its demonstration in EEG is important because of the ease of use, low-cost and accessibility of EEG. Furthermore, we have shown that we can exploit the multivariate nature of EEG electrode data to achieve better performance than can be achieved when using each electrode separately.
Because of concerns over the signal-to-noise ratio (SNR) of EEG responses to speech we initially set out to assess whether it would be worthwhile to alter natural speech in order to accentuate onsets and offsets. We did this in two ways: first by enhancing the envelope of the speech using a method that involved mapping a derivative of the speech envelope onto the original speech and second by inserting 30ms gaps of silence into the speech stream at pseudorandom locations. Our ability to reconstruct an input speech stream did not seem to be significantly improved using either of these manipulations relative to natural speech, so we resolved to just use natural speech for the rest of the project.
Subjects in our experiments listened to two different simultaneous auditory streams, and we directed them to attend to just one of the two streams. Each stream was a audio-book recording of a Jules Verne story, either "20000 Leagues under the Sea" or "Journey to the Center of the Earth," spoken by a male speaker. Stories were presented at the same level, and in one of three different styles: diotic (both streams summed and presented identically to both ears), dichotic (one story to each ear), or via head-related transfer functions (HRTFs, to put each story in a different spatial location - 30 degrees left or right of mid-line). We also collected data on subjects listening to a single speaker (presented diotically), which was used as training data to fit the decoder.
We recorded our EEG signals with a BrainVision ActiCHamp active electrode data acquisition system. We used a 32-channel cap with electrodes locations in the standard 10/20 system. In addition two auxiliary channels recorded the envelopes of the incoming acoustic stimuli and we used a third auxiliary channel to simultaneously measure horizontal and vertical bipolar electrooculogram (EOG) data. Data was acquired using BrainVision's open-source Python-based PyCorder acquisition software and further analysis/decoding of the data was performed using MATLAB.
We studied three methods for deciding which speaker the subject was attending. The first approach (CCA) is based on measuring the correlation between audio streams and the EEG signals. On the other hand, the second two approaches estimate the attended speech from the EEG, and then measure the correlation between the reconstruction and the input audio streams to make a decision.
- Single time-lag CCA - This method finds a linear weighting of channels, defining an EEG-channel subspace that maximally correlates with the attended speech envelope at some fixed time-lag. We test all lags from a range of 0 to 300ms and fix the subspace and lag to those that maximize the correlation in the training data.
- Multiple Single-Channel Reconstruction - This method finds a linear filter that describes the relationship between a window of EEG data and the stimulus at a particular time. It is the inverse of the AESPA filtering approach described in Lalor & Foxe (2010). It solves for this filter for each EEG channel separately. While this approach lends some interpretability to the forward model, it is suboptimal in the case of stimulus reconstruction due to redundancy and a lack of optimal normalization. This method is similar to the method below, but assumes an identity covariance matrix.
- Multi-Channel, Multi-Time Reconstruction - This method is similar to the one described above, but finds a multivariate linear filter that incorporates the channel covariance structure in the least-squares estimation of the impulse response, similarly to the approach described by Mesagarani & Chang (2012).
As a first step, we assessed each of the above approaches in terms of decoding single-speaker data. That is, we fit each of the above models to a subset of a dataset that was collected when the subject was listening to just one speaker. We then assessed how well we could estimate the stimulus that generated another subset of the data that was not used in fitting the models. Having confirmed that these models could all estimate novel stimuli, we applied the same approach to examining the ability to determine which of two simultaneously presented speech streams was attended. We trained our models on single-speaker data. In some cases we trained our models on the two speaker data - either on the attended stream, the unattended stream or the mixture.
We measured the performance of these three approaches in off-line experiments.
- Single time-lag CCA - Training and testing on single-speaker data: We used CCA to estimate a projection matrix that maximizes the correlation between the stimulus (in a training experiment) and a measured EEG. In testing, we use the subspace defined by this projection matrix to measure the correlation between the individual test speech stimuli and the received EEG signal. We showed that the correlation score was higher for the real stimulus, up to about 97% of cases using "enhanced" speech --- i.e., speech where the onsets and offsets were artificially enhanced by mapping a filtered version of the Hilbert transform of the original speech back on the original speech. The decoder used the full complement of channels and was trained on both the original and enhanced single-speaker data (apart from the two runs that were reserved for the test). The decoder worked best for enhanced speech relative to original speech. All lags were tested from 0-300ms and the decoder seemed to work best with a time-lag between 70-90 ms. While we could estimate which speech had been presented in each run with high accuracy, it is worth noting that the correlation value was low with a typical correlation of r = 0.03. We trained a similar decoder to test our ability to determine which of two simultaneous streams was attended - this was done as above using only the attended stream as the training samples. The performance in this case varied from 65-80% depending on the type of stimulus presentation (dichotic presentation worked best for this method). The decoder time window in this case, however, was in the 250-300ms range. The results reported above are for 1 minute-long segments of speech.
- Multiple Single-Channel Reconstruction - Training and testing on single-speaker data: we estimated the stimulus that generated some EEG (that was not used in training). We compared this estimated stimulus with the actual stimulus and with another randomly chosen stimulus from the same testing session and showed that the correlation score was higher for the real stimulus in 99% of cases using natural (i.e., non-enhanced) speech. The decoder used the full complement of channels, independently decoding and summing the estimated response from each channel, and was trained on both the original and enhanced single-speaker data (apart from the two runs that were reserved for the test). The performance for "Enhanced" speech was slightly worse (80-90%). Again, the correlation between the actual envelope and the estimated envelope was low with a typical correlation of r = 0.04. We then used the same single-speaker decoder with all channels to test our ability to determine which of two simultaneous streams was attended. The performance in this case varied from 65-80% depending on the type of stimulus presentation (diotic presentation worked best for this method). The decoder time window for all of the above tests was 0-250ms. The results reported above are for 1 minute segments of speech.
- Multi-Channel, Multi-Time Reconstruction - We conducted the same series of experiments as outlined for the previous two methods. Again we could estimate single-speaker data with high accuracy. Notably we could estimate the envelope with a higher correlation than the other two methods r ~= 0.08. Applying this to the attention paradigm (after training on single-speaker data) we could obtain an accuracy of up to 95% for dichotic speech in one minute segments. As we shortened the amount of data used to decode, our accuracy fell almost linearly to about 65% for 10 seconds. Other presentation conditions (i.e., diotic and HRTF) were decoded with lower accuracy than dichotic. The relationship between the amount of data used and the decoding accuracy is shown below. The success of this method is probably due to the highly correlated nature of the EEG signals, which are taken into account in the estimation of optimal weighting function for each electrode.
The overall classification performance for this multi-channel, multi-time decoding is shown below.
Interestingly, a non-causal series of lags seemed to work best for this decoding approach. We used a window of data from -200ms to +200ms in fitting and using the decoder. Furthermore, we identified a subset of channels that appeared to give better performance. Somewhat surprisingly these were not located over central/fronto-central sites. A figure showing the chosen subset of channels is shown below.
Two papers have been proposed that follow up directly on the work carried out during the workshop. Taking advantage of EEG data that was previously acquired in two-speaker attention paradigm (Power et al., 2012), we aim to 1) examine how well the speech envelope representation in the EEG correlates with behavioral performance, and 2) more formally examine the relative advantages and disadvantages of the 3 methods outlined above (in addition to a fourth machine learning classifier approach).
We have also discussed a number of new follow up experiments that may be interesting to carry out using MEG (Ding 2012).
We couldn't have done this project without the assistance of BrainVision. They loaned us an actiCHamp amplifier system with 32 active EEG electrodes. The active electrodes were great because it made it easier to setup our experiments. It was easy to modify their PyCorder software to send the EEG signals, via UDP, in real time to a Matlab process that implemented the decoding. The product details are at: http://www.brainproducts.com/productdetails.php?id=42 Thank you!
Nai Ding and Jonathan Simon. Neural coding of continuous speech in auditory cortex. J. Neurophysiol, 108, pp 78-89, 2012.
Edmund Lalor and John J. Foxe. Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution. European Journal of Neuroscience, 31, pp. 189-193, 2010.
Nima Mesgarani and Edward Chang. Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485(7397), pp. 233-236, 2012.
(97.1 KB) - added by malcolm
11 months ago.
Project Decode Overview Figure
(1.2 MB) - added by edmundlalor
11 months ago.
- accuracyVStime.jpg (0.6 MB) - added by edmundlalor 11 months ago.
- chosenChannels.png (191.0 KB) - added by edmundlalor 11 months ago.