Discussion Comparing Auditory Saliency Models

Description: We'll have at least two different auditory saliency models in Telluride (Ozlem and Kayser). We should compare and contrast them. Add in any new models we think of.

Expected Outcome: Either a winning model, or more likely a review article that describes their relations (and compare to visual).

Summary of the discussion

During the workshop, we have been exposed to two auditory salience models (Kayser vs. Kalinli). At the end of the workshop, we sat down for a discussion of saliency models: how do the current models fare? what should models aim to do? is saliency a useful concept in audition?

Background auditory clutter

A major problem we had when using auditory saliency models is that everyday situations tend to have background noise. The current implementations of auditory salience operate over relative short time-scales (approx. 320 ms for the Kayser model). This is shorter than typical “silences” that are the gaps in conversations; but these are not silent. There will generally be a second conversation or other auditory clutter in the background. Auditory salience models seem to pick the most salient sound in this clutter and, as part of the normalization process, exaggerate its importance. This perhaps doesn’t represent true salience.

As Nuno has pointed out, saliency works on many time-scales. When a talker starts talking, they may be very salient; if they’ve been talking for 10s of seconds, they may become easy to ignore.

Recognition & Saliency

Here at the workshop, we have considered using saliency maps as a first step towards speech recognition. But perhaps saliency isn’t ideal for machine recognition. Do we really want all the low-level features exaggerated? Perhaps saliency models could help find the words to be recognised, but then a different algorithm would take over.

By analogy, visual salience is great for getting a machine to foviate to the correct object, but then a different object recognition process takes over. Perhaps in being critical of auditory saliency we’re over-estimating how useful the visual saliency is.

Localization & Saliency

Current models don’t take binaural cues into account. Binaural cues help tell us the direction the sound is coming from. It seems likely that we would use these cues as part of saliency. If most of the sounds have been coming from the left, then a novel sound coming from the right is likely to stand out.

Measures of salience

In order to progress with auditory salience, it seems that we need to have some way of comparing models: we need experiments. In order to do experiments on auditory salience, we need to have some kind of a measure of auditory saliency.

In natural visual scenes, saliency is measured by eye movements. The most salient points are the ones that people are most likely to foviate to. Unfortunately for auditory research, we don’t move our ears, and we don’t move our eyes. We move our heads a little, but this doesn’t appear to be a very automatic reaction, and it wouldn’t provide as precise information as saccades in the visual domain. Thus it seems that we need an alternative. Detection was proposed as an alternative to salience. It certainly seems related: if a sound cannot be detected, it cannot be salient, and a salient sound is clearly detectable, and there are gradients in between. We have good measures and statistics related to detection – it’s not binary, detected or not. Obviously, when we discuss salience, we’re talking about the upper end of detectability. And unfortunately, it’s more difficult to measure these contrasts of different levels of detectability.

Nuno pointed out an important distinction between salience and detectability. In visual search tasks, there are some “salient” differences (e.g., orientation) which pop out. Other features can only be found by a serial search. Thus, in a sense, both tasks are equally detectable. The crucial difference that makes one salient and the other not is that the salient feature allows parallel processing – the feature jumps out. The non-salient feature requires the serial search. These suggest different processes, and, what’s more, different processes that can be measured.

An auditory equivalent of this could perhaps be measured in terms of selective response times. There are some known differences in auditory response times (e.g., selective RTs to voices are signficantly faster than for instrumental sounds matched in pitch and power). This could be extended to more cluttered auditory scenes.

Kailash pointed out that this is more of a challenge with auditory scenes: with vision, many objects can be presented simultaneously that interfere little at the level of the retina and V1. For audition, most natural sounds would mutually mask each other if presented simultaneously. This would make the interpretation of the experimental results more difficult. One workaround for this is to present sounds one after the other, hoping to see faster reaction times for salient sounds.

Nuna pointed out that it was very difficult to talk about saliency without including top-down influences and tasks. Many natural visual scenes (the ones we’re faced with on a daily basis) involve many, many salient objects. What we find salient depends on the context. But is salience not meant to be some kind of task-independent measure of the things that have the potential to attract attention? Or at least the stimulus-driven compoment of this? Perhaps this can only be separated out to the extent that bottom-up and top-down elements can be separated out in an interactive auditory system where top-down influences may affect how low-level sounds are processed.

Saliency replacements?

Perhaps for many of us, we’re more interested in what auditory features we can attend to. Can we attend to localization, can we attend to pitches, spectral regions?..

Perceptual features and salience

On a related theme, the features that we perceive surely play a part in saliency. In vision, our basic features include orientation detectors, for example, whose impact on saliency is a line can pop out of other lines if it is at a different angle. In hearing, we don’t know what the basic features are.

Should we wait until we know more about auditory features before attempting saliency models?

Shih-Chii suggested that you need to choose the potential features, run with it, and see how it works, even if the features are wrong. Nuno agrees that this is what worked in vision.

In audition, we have not experimented with many different potential features. A popular starting-point is the pure tone. This has a lot of face value since we know that the cochlea (approximately) separates sounds into consituent pure tones, when the pure-tone frequencies are sufficiently distinct. However, psychophysical results suggest that we are surprisingly incapable of processing pure tones individually. For example, we have a lot of difficulty in counting simultaneously presented pure tones, even when there are only two or three pure-tone components (Thurlow & Rawlings, 1959). This suggests that the basic auditory features can be formed from combinations of pure tones, and as such, pure tones are unlikely candidates to be the basic auditory features.

The current auditory saliency models use slightly more complex features, but still share the same working assumption that the basic features are local in frequency. This restriction is perhaps a hang-over from the visual domain, in which objects tend to be spatially localised. However, it is not clear how to replace this assumption in the auditory domain.

One “feature” that seems to cause pop-out in audition is loudness. Unfortunately the current auditory saliency models do a very good job of normalization medium-term variations in loudness, meaning that they play down potential pop-out due to loudness. Perhaps a suitable measure of saliency would be one way to measure the basic features of saliency.

A basis in perception for saliency

At some level, saliency has a basis in perception. As such, future models of saliency may benefit from taking into account what is known about auditory perception. For example, the current models of auditory saliency are based on a linear frequency scale, rather than a log scale, which would better represent the human auditory system. The Kayser model uses a fixed range of scales for each feature. The size of these features were presumably selected to be sensible or useful values. However, because of the linear scale, the effective perceptual size of the features approximately doubles with each octave. Starting with a frequency scale that better represents perception may make it easier to calibrate future saliency models, and allow more meaningful comparisons with other perceptual results.

A well-established perceptual result that could improve saliency models is the threshold of hearing, which vary with frequency. The current models of frequency treat all frequencies equally, in principle even outside the range of human hearing. Clearly inaudible frequencies are unlikely to be salient. More generally, a first assumption would be that the salience of an object would be related to its hearing level. if this were not the case, it would be an interesting result that some frequencies disproportionately attracted attention.



All attention faculty