Attention Model

The divisive normalization model (Reynolds and Heeger) is defined for one set of image features. The goal is to extend this model (and its relatives) in several different directions: including real images, auditory and auditory-visual domains.

We start with a front-end based on bottom-up saliency features (itti-Koch-Niebur) and test its performance on real images. Since this is not supposed to be simply a bottom-up saliency task, we cannot rely on simple eye-tracking data to test performance. There has to be a “cognitive” component where a task is defined. We propose to collect/download data with some target. The objective is to translate the target into a set of primitives, then use the model to “enhance” the representation of the target and test for its presence. Note: in this project, saliency front-end is merely a front-end; and may or may not need to be exploited beyond extracting features!

Expected Outcome: Show performance either ‘real-time’ using a robotic platform or ‘offline’ on a collection of data.


  1. Francisco Barranco - Visual lead
  2. 'trevor'
  3. Kailash Patil - Auditory co-lead
  4. 'merve' - Auditory co-lead
  5. Fabio Tatti
  6. Shih-Chii Liu - AV Mentor
  7. Barbara Shinn-Cunningham (Boston) - Auditory Mentor
  8. 'Vasconcelos' (UCSD) - Visual Mentor
  9. 'Fred' () - Visual Mentor
  10. 'Jude Mitchel' (Salk) - Visual Mentor

The Visual Plan

The Audio Plan

The Audio-Visual Plan

Goal of project is to explore the use of attention in a multi-modal fusion task. we learn the identity of an audiovisual object through temporal correlation of auditory and visual features. The features that are temporally correlated are first learned through a network. The features that are useful for identification of the object have higher gains after the learning. Later in testing within a scene with distractors, the attentional signal is used, for example, to track the object in auditory space when the visual signature is present and vice versa, or for ignoring distractors.

subproject 1; generation of a object simulation environment by programming a visual circle moving in a specific direction and accompanied by a sound that could for example change in pitch when the object moves up and down. how can we program this environment? and how can we synchronize the start of stimuli experiment with start of recording from vision and audition?

subproject 2: use rate-coded version of spike outputs of retina and cochlea and first look for temporal correlations of visual and auditory features, pass it through a network that classifies the object, then use similar features for a saliency map, in test environment with distractors, take the attentional output and bias the outputs of the features specific to the object.

question: do we explore the use of space as part of the feature map?

subproject 3:use visual features from retina output and design cochlea features from cochlea - do temporal cross-correlation of the features, do similar process of object recognition and then feedback from attention using spiking networks or rate neurons.