Attention-Driven Scene Analysis

Members: Adam McLeod, Ching Teo, Daniel B. Fasnacht, Francisco Barranco, Janelle Szary, Kailash Patil, Malcolm Slaney, Mounya Elhilali, Michael Pfeiffer, Shih-Chii Liu, Barbara Shinn-Cunningham, Tomas Figliolia, Timmer Horiuchi, Tobi Delbruck, Troy Lau, Yezhou Yang

- Organized by Julien Martel & Mounya Elhilali with Malcolm Slaney


  1. Julien Martel (McGill University) – Not able to come
  2. Mounya Elhilali (Johns Hopkins) – Week 1
  3. Malcolm Slaney (Yahoo! Inc. and Stanford Univ.) – 28Jun/16Jul
  4. 'Fred Hamker' (Chemnitz University of Technology) – 8Jul/15Jul
  5. Jude Mitchel (Salk Institute) – 28Jun/2Jul
  6. 'Ozlem Kalinli' (Sony) – Week 2
  7. Kang Wang (Ohio State) – 6Jul/9Jul
  8. 'Nuno Vasconcelos' (UCSD) – 11Jul/15Jul
  9. Barbara Shinn-Cunningham (Boston University) – Week 2


A central problem for modern information processing systems is that they can gather more information from the environment than they can process in real time. Sensory systems expanded on the surfaces and inner organs but the central processing unit grew much slower, so gathered signals could not be processed all in parallel. Specializations such as the human eye with a central high- resolution fovea and the ability to rotate within the orbit, and the hand with the high density of touch receptors, accompanied by an over-representation of these parts in the central processing unit (the brain) partially solved this problem.

However, such hard-wired specializations are not enough to relieve the overload of the central processing unit (e.g., the brain). Fortunately, nervous systems have adapted to solve this problem quickly, efficiently and thoroughly. Essentially all animals, including insects, have developed mechanisms of selective attention. Attention is a primary cognitive function, and arguably one of the most important aspects of understanding the world around us. Without limiting the amount of information that is to be processed in detail in a smart, and situation-dependent manner, higher-level cognitive processing would be impossible. Auditory attention is used to “hear out” the desired signal, yet attention can shift with sounds that are particularly salient. Visual attention is used to explore our world, helping the animal to direct its attention to portions of the visual field needed to understand the scene. Importantly, all these different inputs from different modalities have to be integrated and taken into account for making a “good” behavioral decision.

The Attention workgroup aims to explore attention-driven scene analysis. Auditory and visual scene analyses are both important applications and areas of research. We want to understand how to select and attend to the most relevant portions of a scene. We believe that this problem is one of the most important aspects of cognition and perception. Potential research directions include saliency (bottom-up), selection (top-down) and the role of synchrony (implementation). We envision three kinds of participants: general expertise on the theory, neurophysiology and psychophysics of attention; auditory scene analysis modelers; and computer-vision researchers interested in attention and selection.



  • 2011/att11/SaliencyMeasurent - Discussion of benchmarks/tests to drive future auditory saliency work
  • 2011/att11/AuditorySaliency - Discussion of current auditory saliency models, possible review paper
  • Do audio and visual attention work the same way (except for different input)?


Discussed Projects

Original Subtasks (all incorporated into projects above..)

Task 1: Extend divisive normalization model to work on real images

Description: The divisive normalization model (Reynolds and Heeger) does not really operate on a rich feature-based representation. The goal is to extend this model with a front-end based on bottom-up saliency features (itti-Koch-Niebur) and test its performance on real images. Since this is not supposed to be simply a bottom-up saliency task, we cannot rely on simple eye-tracking data to test performance. There has to be a “cognitive” component where a task is defined. I propose to collect/download data with some target. The objective is to translate the target into a set of primitives, then use the model to “enhance” the representation of the target and test for its presence. Note: in this project, saliency front-end is merely a front-end; and may or may not need to be exploited beyond extracting features!

Expected Outcome: Show performance either ‘real-time’ using a robotic platform or ‘offline’ on a collection of data.

Lead people/person: Jude Mitchel, Tomas Figliolia, 'fbarraco', 'austin.meyers'

Task 2: Extend divisive normalization model to auditory modality

Description: Similar to task # 1but exclusively on auditory modality. Here, we can try Kayser/UCLA front-ends along with other features. Here again, the goal is to define a target and test for its presence. One of the main components of Reynolds and Heeger model is normalization. In the auditory modality it is not clear how this will work, however, one could use intensity of the sound as stimulus contrast and other properties like pitch as equivalent to visual features such as color or orientation.

Expected Outcome: Either real-time or offline.

Lead people/person: Kailash Patil

Task 3: Similarities and Difference among models of visual attention

Description: The goal is to compare models of Reynolds and Heeger (or its extended version per task #1) with Hamker’s model or Nuno’s model. This may be more of a literature review to explain to us [workgroup participants] the similarities/differences between the models, phenomenological comparisons, applicability/test on real-word images.

Expected Outcome: May or may not have a demo, but would be nice to have an overview of what processes have been modeled in attention (visual) + advantages/disadvantages for real applications.

Lead people/person: Kailash Patil

Task 4: Does visual and auditory attention operate the same way

Description: Strictly ‘scientific’ question with no cool demos per se... The purpose is to ask whether results simulated using the Reynolds and Heeger model can explain changes in receptive fields in auditory cortex under attention spotlight. We can get real neural data that was used for the following paper: Atiani S, Elhilali M, David S, Fritz JB, Shamma SA (2009) “Task difficulty and performance induce different adaptive patterns in gain and shape of cortical receptive fields”, Neuron, 61(3), pp. 467-480. This paper was also asking the question how to reconcile gain and contrast changes in cortical responses.

Regarding the question of response gain vs. contrast gain this is an interesting one. For example one may model both and ask which one does the best. Julio and colleagues have written a paper on this (Khayat el et. 2010) where they show that contrast gain seems to be the best model.

Expected Outcome: Show parallels and differences between auditory attention (effect on auditory cortex receptive fields) and visual attention (effect on cortical responses). This project will involve careful literature review and input/guidance from faculty.

Lead people/person: Yiannis Aloimonos, 'fbarraco'

Task 5: Spike-based attentional processing

Description: The project will use Tobi's retina and nengo for trying to implement simple visual models. Initially, the plan is to have a set of gabor filters in spikes using nengo that works with the retina(low hanging fruit). Next, the plan is to implement a simple HMAX like models and move on to Nuno's model for saliency.

Expected Outcome: Real-time demo with the silicon retina

Lead people/person: 'siddharth', Shih-Chii Liu, Sushant Rao

Task 6: Discussion of saliency benchmarks for auditory modality

Description: This may or may not evolve into a full fledged model, but should start as a discussion among faculty and interested participants as to what is the equivalent of the eye tracking data for auditory saliency

Expected Outcome: May involve some data collection or recommendations for data collection

Lead people/person: 'Merve Kaya'

Task 7: Multi-object attention

Description: This could be explored depending on interest from participants/faculty, but would basically involve exploring multifocal attention and how attention split might deteriorate performance relative to focused attention. How would one approach modeling this problem. Some psychophysical data could be collected in telluride (both visual and auditory).

Expected Outcome: Could show some results from psychophysical tests as well as modeling approaches to multi-target attention

Lead people/person: Shih-Chii Liu, Jonathan Tapson

Task 8: Saliency and brain waves

Description: The premise of this project is that neurons and groups of neurons in the brain behave as oscillators with peak amplitudes in different frequency bands. We will test the theory that attention modulates the synchrony between the activities of different groups of neurons as well as Local Field Potentials in a frequency-dependent manner. This hypothesis has been supported by neurophysiological data, but has yet to be explored in a detailed model of scene analysis. This could also have a follow-up experiment using EEG recording sounds and images that belong to the same and different objects and analyzing synchrony between visual and auditory brain areas.

We could possibly also use the input from the artificial retina and convert into oscillations. Attention could synchronize the areas of the retina that are spatially attended. In case we have multiple layers of detectors tuned for different features attention could also synchronize them when a certain feature is attended. One question here is in which frequency band the oscillations would do best. Also oscillations between auditory and visual detectors could produce the object percept (see Singer’s work).

Expected Outcome: Model comparison using ‘classic’ models of saliency vs. synchrony-based saliency.

Lead people/person: 'Merve Kaya', 'trevor', Shih-Chii Liu

Task 9: Attention during smooth pursuit eye movements

Description: Space centered vs. retina center attention. In this project, one can use eye movements to dissociate retina center and space center targets. A saliency map that is retina center cannot encode target position in space since the coordinates of an object during pursuit may not change in the retina if the object moves with the pursuit target, however the object does change position in space. So far, models of attention are retina-centered. Any robotic device that uses retina center maps cannot perform well in dynamic scenes unless they implement some kind of re-mapping mechanism.

Expected Outcome: Proposals and demo of new ways to encode attention in smooth pursuit

Lead people/person: Tomas Figliolia, 'fbarraco', Jonathan Tapson, Yezhou Yang

Task 10: Compare Auditory Saliency Models

Description: We'll have at least two different auditory saliency models in Telluride (Ozlem and Kayser). We should compare and contrast them. Add in any new models we think of.

Expected Outcome: Either a winning model, or more likely a review article that describes their relations (and compare to visual).

Lead people/person: 'Merve Kaya', 'trevor'

Task 11: Build Eye Tracker

Description: There is open-source eye-gaze software. This might be a good first exercise for an engineering student. (Malcolm will bring a camera). This might not be good enough for serious eye-tracking for attention.. but it might be good enough for auditory attention studies (as a source of pointing information...)

Expected Outcome: simple eye-tracking solution for attention projects.

Task 12: Multimodal attention using silicon sensors

Description: Explore audio-visual integration based on inputs from the cochlear and retinal chips, and the possible role of attention. possibly use dissociation of auditory and visual cues that belong to the same or different objects.

Expected Outcome: Real-time demo of multimodal integration and attention

Lead people/person: 'trevor', 'Merve Kaya', Janelle Szary, Shih-Chii Liu