Decoding Speech vs. Music

There is some debate about whether music can be decoded from EEG. The two experiments here approach the topic from two angles. 'Speech to Music' compares within-subject decoding from actual speech and actual music signals. 'Utterance to Song' (sometimes called 'Speech to Song') compares within-subject decoding of speech signals that may be perceived as song after several repetitions.

Speech to Music

Basic Idea

This experiment was designed to find the difference between speech and music TRFs. The different conditions recorded in this experiment provide examples of EEG coming from a continuum of sounds ranging from speech to music. The different conditions described below have different amounts of the speech (or singing) envelope showing in the clean audio waveform. This parameter was varied because EEG decoding has often successfully used the speech envelope, but the envelope of music may not be a relevant feature for decoding.

Summary of Conditions

The file descriptions below can be found in in 'music_to_speech.mat'. The stimuli in this experiment were performed and recorded by Alana L. Graber.

'Meadowlark_accomp_alone.wav'10948877248.273854875283 5

1. plain speech alone - contains ALG speaking lyrics
2. rhythmic speech alone - contains ALG speaking same lyrics in correct rhythm
3. singing alone - contains ALG singing the lyrics a cappella
4. music backtrack alone - only music, no human voice
5. rhythmic speech with medium backtrack - ALG rhythmic speech combined with backtrack ( the speech envelope is visible )
6. rhythmic speech with high backtrack - ALG rhythmic speech combined with backtrack ( the speech envelope is not visible )
7. singing with medium backtrack - ALG singing with backtrack ( the singing envelope is visible )
8. singing with high backtrack - ALG singing with backtrack ( the singing envelope is not visible )

Experimental Setup

Each of the 8 conditions was played twice, and the presentation order was pseudo randomized. Triggers occured at the onset of every condition and every minute until the end of the condition. The subject was able to take short breaks between conditions, pressing a key on the stim computer to trigger the next condition. There was no task; the subject was instructed to listen actively, stay still, and look at one spot in the distance. The experiment lasted approximately 60 minutes.

About the Data

Channel 2 (Fz) was the reference channel. Triggers were recorded on 'aux1', the last time series in the raw data. The very first trigger (shorter than all the others) should be ignored. There are 74 correct triggers, 16 of which correspond to condition onsets.

Subject 1

For subject 1, the conditions were presented in the following order:

simuli start_time end_time delta_time trigs_per_stim
'Meadowlark_rhythmicspeech_highaccomp.wav' 6.32040474188398 254.088033090753 247.767628348869 5
'Meadowlark_singing_alone.wav' 310.193322952371 556.861203639361 246.66788068699 5
'Meadowlark_plainspeech_alone.wav' 589.4767481296 713.357992150763 123.881244021162 3
'Meadowlark_singing_highaccomp.wav' 734.927391293473 983.253587301093 248.32619600762 5
'Meadowlark_rhythmicspeech_alone.wav' 1011.56636685073 1244.60840434152 233.042037490784 4
'Meadowlark_singing_medaccomp.wav' 1265.12259860146 1513.38382918769 248.261230586228 5
'Meadowlark_rhythmicspeech_medaccomp.wav' 1819.29732552388 2067.57324138038 248.275915856502 5
'Meadowlark_accomp_alone.wav' 2075.87399339031 2324.18779499596 248.313801605647 5
'Meadowlark_rhythmicspeech_highaccomp.wav' 2336.05146735291 2584.33299739577 248.281530042863 5
'Meadowlark_rhythmicspeech_medaccomp.wav' 2625.76334533752 2874.10811889726 248.344773559744 5
'Meadowlark_singing_medaccomp.wav' 2888.09707530301 3136.38323520184 248.286159898824 5
'Meadowlark_singing_alone.wav' 3143.09215057136 3389.74769382975 246.655543258385 5
'Meadowlark_rhythmicspeech_alone.wav' 3412.50817548567 3645.6235584872 233.115383001539 4
'Meadowlark_singing_highaccomp.wav' 3657.52446178453 3905.85879959425 248.33433780972 5
'Meadowlark_accomp_alone.wav' 3926.05284125796 4174.3629865593 248.310145301337 5
'Meadowlark_plainspeech_alone.wav' 4227.30223755199 4351.19967124828 123.897433696286 3


4 conditions were selected for analysis: plain speech, singing alone, backtrack alone, and singing along with medium-level backtrack. Two TRFs were generated for each repetition (2 total) of the 4 selected conditions. M100 shows a polarity reversal in some cases.

Error: Macro Image(TRF_singing_medaccomp.png) failed
Attachment 'wiki:DecodingSpeechMusic: TRF_singing_medaccomp.png' does not exist.
Error: Macro Image(TRF_accomp.png) failed
Attachment 'wiki:DecodingSpeechMusic: TRF_accomp.png' does not exist.

The same data was used to make four linear normalized reverse correlation models - Both models were used to reconstruct the four conditions of interest. The correlations are summarized below.

Plain Speech Alone Singing Alone Singing with Accompaniment Accompaniment Alone
Plain Speech Model
Singing Model
Singing with Accomp Model
Accomp Model

Utterance to Song

Basic Idea

About the phenomenon of speech to song: The sounds used in this experiment are short clips of speech. Upon repetition, half of the clips seem to transform into song, i.e. the pitch and rhythm of the speech clip seem like singing. The other half of the clips do not transform although the prosodic content of ALL the speech stimuli are comparable. The percept of song begins within 10 repetitions of a transforming stimuli. These exact stimuli were used in an fMRI study. Of the 48 availble speech stimuli, only 12 transforming and 12 non transforming stimuli were selected for this EEG study.

Summary of Conditions


Maarten (leader), Emily, Jens, Lisa, Sahar