Auditory Gisting

Participants 'Clara Suied', Nima Mesgarani, 'Daniel Presnitzer', Malcolm Slaney

How short can a sound be and still support recognition? Anecdotal evidence indicates that human listeners are remarkably able at identifying complex sounds. There is however a surprising lack of behavioral data using psychophysical methods in acoustically controlled conditions that may shed light on the mechanisms involved (Ballas 1993; Robinson and Patterson 1995).

Before coming to Telluride, we ran a series of psychophysical experiments using sounds gated in time, to address this question (see also Agus, Suied, Thorpe, and Pressnitzer, ISCAS 2010). Gating restricts the duration of a sound by applying a shorter and shorter time-window on the original signal, until the task becomes impossible. Applying such strong constraints on the sounds should, we hope, help to unravel some of the mechanisms responsible for identification. We used a large corpus of natural sounds (N=480): singing voices (5 vowels by 4 speakers) and musical instruments (20 instruments), all presented at the same loudness and over a range of 12 different pitches for each type of sound (from A3 to G#4). The stimuli were gated in time, using Hanning windows with durations of 2, 4, 8, 16, 32, 64, or 128 ms. The starting point of the gating was chosen randomly in the original sample. In each trial, listeners heard a short sound and had to indicate whether it was a voice or not – musical instruments were the distractors.

Results obtained in one of the experiments are shown in Figure 1, d-prime (d') is the sensitivity index of signal detection theory. High d' represent reliable recognition of the target (i.e. the voice). d' of 0 indicates that performance is not better than chance.

Results of the psychophysical experiment

Figure 1: Psychophysical results

During the workshop, we tried to understand what were the acoustical features available to the participants to recognize a sound with such a short duration: at 16 ms, for example, performances are already very good. Different models were used.

The first method we applied is described in the schematic in Figure 2.

Figure 2: first method used to model the psychophysical data

Classification was performed using a support vector machine (SVM). We either used a linear SVM (L-SVM), or non-linear SVM (NL-SVM), for which Radial basis function (RBF) were used as SVM kernel. In addition, we either performed the training and the testing for the same durations, or the training at 128 ms and the testing on all the durations.

Results are shown in Figure 3.

Figure3: Models using the auditory spectrogram

Linear SVM on the output of the auditory spectrogram does not predict accurately the data. A non-linear SVM on the auditory spectrogram fits almost perfectly the data if the learning was done at 128ms. The same non-linear SVM overestimates performance if trained and tested on the same duration. This means that participants used complex spectral features to do the task, which they extracted either from the long sounds or from long-term memory for voice.

Unfortunately, the use of a NL classifier makes it more difficult to understand which acoustical features exactly were used by the model. A second method was then tried (see Figure 4).

Figure 4: second method used to model the psychophysical data

Instead of having the complexity at the classifier stage, we have it for the auditory representation; in this case, a cortical model of complex responses was used (Shamma, Mesgarani). Results are shown in Figure 5.

Figure 5: Models using the cortical model

Based on the cortical representation, this time, a linear SVM seems to classify the sounds with similar performance as human listeners.

This is a promising result, in the sense that now, we will be able to go back and look at the acoustical features that were used. This is a necessary first step for the future work.


Ballas JA (1993). "Common factors in the identification of an assortment of brief everyday sounds." J Exp Psychol Hum Percept Perform 19(2): 250-67.

Robinson K and Patterson RD (1995). "The stimulus duration required to identify vowels, their octave, and their pitch chroma." J Acoust Soc Am 98(4): 1858-1865.

Agus T R, Suied C, Thorpe SJ and Pressnitzer D (2010). "Characteristics of human voice processing". IEEE International Symposium on Circuits and Systems.