Extracting Speech from Nonsense - "Superstitious" Auditory Perception

Coordinator: John Foxe

Project idea

The Ghost in the Machine

The basic idea is to present noise signals to a subject. The user is asked to choose which noise sample sounds most like (or contains) the desired word. By averaging many different stimuli, we expect to be able to see an average spectrum (or other representation) that looks like the desired word. We mostly use random stimuli, but occasionally we also include the target word (a lure) in the mixture.

There are a number of tests we can make of this paradigm.

  • Can people do the task at all? How do we average the stimuli?
  • How does the ERP compare for the imagined and the heard words?
  • Can we find a decision variable that indicates which segment the user will choose?
  • Can we derive a decision variable from the user's data, since we know which segment is chosen?
  • Can we derive condition specific reconstruction techniques?

Related work (and papers)

Two papers about using the same "bubbles" technique on speech audio. Malcolm notes: Mandel adds noise (bubble shaped in the spectrogram domain) to the simple speech sounds, and then tests subject's ability to figure out which speech sound they are listening to:

* Michael I. Mandel. Learning an intelligibility map of individual utterances. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013.  PDF

* Michael I Mandel, Sarah E Yoho, and Eric W Healy. Generalizing time-frequency importance functions across noises, talkers, and phonemes. In Proceedings of Interspeech, 2014.  PDF

Data management

  • 4 subjects were run with Shihab's noise (0-4kHz bandpass): Ed.mat/Gio.mat/CL.mat/BH.mat
  • 3 subjects were run with Shihab's noise (0-8kHz bandpass) during EEG recordings: 735794*.mat (2 datasets)/TN1.mat/14_07_17_Gio.mat (2 datasets)

Noise Approaches

There are a number of possible noise sources. We'll try to summarize the options here:

White noise

Least amount of speech content. Purest signal. But is there enough contrast that people will identify a speech sounds?

IRCA Noise

Starts as speech, but modified to become unintelligible. Described here:  http://informahealthcare.com/doi/pdfplus/10.3109/00206090109073110 and there are some noise examples here:  http://medi.uni-oldenburg.de/download/ICRA/ and  http://www.icra.nu/?page_id=58

ISTA Noise

Friends at Starkey recommend this article on random speech that describes an approach called ISTS:  http://blog.starkeypro.com/a-preferred-speech-stimulus-for-testing-hearing-aids/. But this really does sound exactly like speech.

Random dots in a spectrogram

Random dots in TORQ space (STRF)


Combining lots of TIMIT speech leads to something that sounds like cocktail party. Listen to them here: [BabbleExamples].

EEG Experiment Design

Three subjects (DN, TN, GD) participated to this EEG experiment. Each of them had to perform 500 forced choices between pairs of stimuli. Furthermore, we collected 500 and 248 additional trials respectively for GD and DN. Each stimulus was randomly extracted from a ~50s sound file generated in advance using Shihab's noise approach. In order to guarantee the time alignment in the post-processing of the recorded data, a trigger was sent simultaneously with the start of each audio stimulus. The subject was instructed to reduce motor movements and to reduce eye movements by fixating a cross-hair at the centre of the screen. The stimuli were played applying a random (uniform distribution) jitter of +-300ms to the start of the sound. The subject had to select one of the two sound by pressing the left arrow or the right arrow of the keyboard respectively for the first and the second sound. Before each choice, he or she had the chance to blink and, if necessary, have a break. Each session of 500 trials (50 of which contained the actual word in one of the two stimuli - the order was randomised) lasted for about 45-50 minutes.

EEG analysis

The EEG responses collected can be classified in 3 categories: the activity recorded while the subject was listening to the stimulus that he selected afterwards, the one for the non-selected stimulus, and the activity elicited by the stimuli which contained the actual word "superstition" (with several SNRs). Time-locked average was performed for each of these three conditions and the distribution of the electrical activity on the scalp in time was studied. Furthermore, a fourth time course was introduced, as it was done in the "faces" paper as well. The DSS toolbox was used on the data in order to extract significant components and to project back a denoised version the EEG recording. This analysis improved visually the results for each of the single subjects topographies. However, the grand average (across subjects) did not improve as much using the same algorithm, therefore the DSS analysis was not performed for the results reported in the final presentation.


The idea extracted from the topographies is that, on average, there is a difference in the neural activity for the processing of stimuli that are more similar to the word that the subject is looking for, and that this difference seems to be relevant up to about 500 ms, even if each stimulus is about 700 ms long. One of the goals is to demonstrate that in that time window the activity is actually stronger or different for the chosen stimuli (yes) compared to the non-chosen stimuli (no). Our hypothesis is that after the first presentation of the template word "superstition", the user uses the same "filter" in order to compare each of the two stimuli presented at every trial in order to detect if the template is there or not. Again, since the subject was primed, the filter applied is the same for each input, even if the latter doesn't contain the magic word at all. We can therefore estimate this filter using the ridge regression-based algorithm mTRF, which returns the multivariate temporal response function of the system given the sequence of stimuli and the correspondent sequence of EEG recordings. The toolbox mTRF can be found at the link  http://sourceforge.net/projects/aespa/files/. Half of the data collected was used in order to train the model (just trial with both stimuli without the template), while the other half was used for predicting the EEG given each pair of stimuli. Finally, the prediction were compared with the actual recordings and the stimulus with the predicted EEG with higher correlation with the recorded EEG was selected for each pair. The obtained sequence of selection is an estimate of the choices of the users and it was shown to have an accuracy above 58% for each subject, values statistically above chance. The idea here is that the filter has the purpose of looking for the template word, therefore it might elicits a stronger cortical activity when the stimulus is more similar to the "superstition". Therefore, the actual response to superstition dominates more the EEG prediction in this case, leaving in it less independent noise and therefore resulting in higher prediction correlations. This was proved with 3 subjects, however further data collection is necessary to ensure that the result is statistically solid. Surely, the value obtained seem to be promising and they could be a proof that the hypothesis on the filter of the word "superstition" is actually built and used after having primed the subject.

Related work (and papers)

There are clear instances when a perceiver has absolutely no idea of the identity of a word buried in noise, where there is clearly still information available. This can be seen when multisensory cuing is used.