Proto-Object Based Saliency Map

The saliency map (Koch & Ullman 1985, Niebur & Koch 1996) has been very successful for understanding a large variety of attentional phenomena. An inherent limitation is, however, that it is based on purely spatially defined regions. It is therefore difficult to explain attention to objects (e.g. Egly et al 1994).

The aim of this project is to develop a saliency map whose fundamental elements are contextually related areas of the visual field, rather than pixels. Following Rensink et al (1997) and others, we call these areas proto-objects. Rensink et al (ibid) define a proto-object as "a volatile unit of visual information that can be bound into a coherent and stable object when accessed by focused attention. Note that while pixel-based saliency maps detect the odd feature out, proto-object based maps provide the shape and extent of attention to be applied.

Border ownership and grouping

The proto-object based saliency map uses the idea of grouping and border ownership cells, as discussed by Craft et al. (2007), to deal with occlusion and overlap between adjacent objects. This is summarized in Figure 1 which shows a border ownership network for two overlapping rectangles. Briefly, two border ownership (B) cells which encode opposite border ownership directions share the same receptive field (i.e., they have identical response properties in all feature dimensions except for their border ownership selectivity). When an object edge falls into their receptive fields, then the B neurons are excited. The cells are mutually inhibitory and provide excitation from their respective grouping cells. The grouping cells receive further excitatory inputs from all B cells within their receptive fields and, on their turn, provide excitation to all B cells within this field. Consequently, for two B cells sharing the same receptive field the B cell with stronger grouping excitation will obtain ownership of that boundary.

In addition to providing the bias to border ownership cells that generates their specific selectivity, grouping cells also provide a handle to a given object for object-based attention. Thus, when attention is directed to a perceptual object, top-down activation is directed towards the grouping cells affiliated with this object (Craft et al, 2007). Note that the grouping cells do not contain the complete representation of the object, it is the border ownership cells associated with the object that define its boundaries.

Figure 1: "Model architecture. A: network overview, showing border-ownership selection for a stimulus of two overlapping rectangles (bottom). Receptive fields of B cells are shown as ellipses, where attached arrows indicate their preferred side of figure. B cells with opposite arrows compete, and this competition is decided by grouping cell input (receptive fields of active cells are shown in green and red; receptive fields of suppressed cells shown in gray)" Craft et al. (2007)

The model

The saliency map makes use of the above border ownership model to define proto-object handles and boundaries. The model accepts any input image (including natural scenes) on which to to calculate saliency. The image is processed via three different channels - intensity, red-green and blue-yellow. Note that, different from the Itti et al (1998) model, no separate orientation channel is introduced at this first processing level. Instead, orientation information is extracted (in all three channels) at the edge-extraction stage (see below).

To ensure scale invariance, each channel separates its input image into a pyramid, downsampling the image by a factor of 2 at each layer of the pyramid. A center-surround operation is then performed on each layer of the pyramid using a Difference of Gaussians filter, generating a Laplacian pyramid. Next, orientation-specific filters extract 0 and 90 degree edges from the center surround (in later versions, additional orientations will be added). Activity in edge maps and center surround maps from each channel are then summed. The final center surround map is used to calculate the grouping pyramid. The border ownership signals are then calculated using the grouping and edge maps.

Figure 2: The proto-object saliency model. The model was inspired by that of Itti et al. (1996) however; certain changes were made to accommodate the grouping mechanism. The most important of these changes is that the model only has two distinct channels - colour and intensity. The orientation channel is now implicit in the edge map/border ownership selectivity information. Secondly, in the original Saliency model it was possible to collapse the final saliency map into a 2D image. In our model, the final saliency is obtained using the grouping cells as object handles. Because there can be multiple handles at the same location coding for proto-objects at different spatial scales, the final saliency map is a pyramid. The most salient proto-object is obtained using an argmax operation over the pyramid. By keeping the saliency map in a pyramid formation it is simple to include top-down modulation of attention to scale by weighting the different levels of the map. Segmentation of the objects is performed by assuming all pixels with contiguous grouping cells and whose border ownership vectors agree belong to the same proto-object. The dashed line between the border ownership pyramid and the proto-object segmentation denotes that this component of the algorithm is incomplete.


Figure 3 below shows an input image of hot air balloons with the boundary of the proto-object describing the elephant shown in green.

Figure 3- Proto-object segmentation of elephant balloon.

The grouping map and associated boundary vectors are shown in the map below. Arrows indicate the direction of border ownership at a given location, with their length indicating the strength of border ownership. The green cross indicates the elephant.

Figure 4: Grouping map and associated boundary vectors. Note the change in direction between vectors at the boundary between two objects.

Results of the algorithm applied to the balloon image are shown below, in order of most to least salient object. The grouping of multiple balloons in a single proto-objects in the first panel (highest saliency) is inconsistent with perception (which would separate them). This can likely be corrected by incorporating border ownership signals into the image segmentation algorithm (dashed line in Figure 2).

Figure 5: The nine most salient objects in the balloon image.


Ernst Niebur, 'arussell' Ralph Etienne-Cummings


Craft E, Schuetze H, Niebur E, von der Heydt R (2007) A neural model of figure-ground organization. J Neurophysiol 97: 4310-4326

Itti, L., Koch, C. and Niebur, E. 1998. A model of saliency-based fast visual attention for rapid scene analysis. IEEE Trans. PAMI 20(11) 1254-1259

Koch, C. and Ullman, S. (1985) Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology. 4, 219-227

Niebur, E. and Koch, C. Control of Selective Visual Attention: Modeling the `Where' Pathway. Neural Information Processing Systems 8:802-808 (1996)

Rensink, R. A., Oregan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368–373.