For work by Andrew and Nima on STRF-based saliency and background models.

The basic strategy attempted here is to build a model of previous audio (in this case, the previous 2 secs), and compare the current audio (.25 sec) to this model to detect deviations. Here, the model is simply a mean and standard deviation of the cortical representation (Mesgarani, Shamma) for each frequency, rate, and scale. Deviations are quantified by the distance of means between the current audio and the existing model, normalized by the standard deviation of the existing model. Further work in this project will see more complicated and higher order model developed that may more robustly model more complicated types of background noises and audio textures.

Figure 1. - Spectrogram of an auditory scene, consisting of a boat motor and the sentence "She had your dark suit in greasy wash water all year" starting at about 6s

Figure 2. - Salience results using the cortical model, and using the same type of processing on simply the cochleagram (spectrogram). When the target is relatively loud, both representations can pull out the novel event. Here, the peak SNR, computed over 16 50% overlapping windows of the target, is roughly -6 dB.

Figure 3. - Salience results when the target is attenuated by an additional 9 dB for a peak SNR or -15 dB. The spectrogram representation can no longer as reliably pull out the novel event, as compared to the cortical representation which results in a still clear peak over the target onset.

Figure 4. - Results using Kaiser's (2005) saliency map. The target is buried in the background, and the novelty is not detected.

Saliency map using cortical model

Figure 5. - Reconstruction of the audio spectrogram after thresholding the cortical representation by magnitude of the saliency measure. This is effectively another form of "salience map." A few events from the background get through at this thresholding level, but the target onset is well represented.


Nima Mesgarani, 'Andrew Schwartz'