Attention mechanisms

As attention mechanism we used two concepts: the torque operator and the object filters. Both mechanisms are designed to localize regions of interest. Fixation points are selected within these regions and passed on to the segmentation algorithm, which then separates foreground from background. The torque operator locates regions in the image containing closed boundaries, and the centers of these regions are selected as possible fixation points. The object filters, in essence, are pixel-based classifiers trained to respond probabilistically to certain object categories. High responses to an object filter correspond to regions likely containing an object, and the centers are of these regions are selected as additional fixation points.

Object filters

Participants: 'Douglas Summer-stay', Ching Teo, Yiannis Aloimonos


In recent work we developed a learning mechanism (Summerstay, Aloimonos 2010) to train filters for the purpose of finding pixels in the image depicting human beings The same concept is used here to detect pixels corresponding to object categories. We use the concept here as attention mechanism and call it “object filters”. In essence, nonlinear filters are trained using a technique in spirit related to the concept of Deep belief propagation (Bengio, 2009) to obtain classifiers that compute the probability of a pixel belonging to an object category in the image. To train the filters we need labeled training images, which is described below.

The idea of the learning is as follows: For a given image pixel, we collect many multi-scale patches from the area surrounding the pixel to obtain a feature vector. We reduce the feature vector’s dimensionality using PCA, and use it as input to a classifier (a mutilayer perceptron). After training the classifier on labeled image pairs, the classification result together with the original image is used again as input to another classifier. The process is repeated a few times, with each repeated classification resulting in a refined probability map. The process is illustrated in Figure 1.

Figure 1. Training the object filter: (Top) Thousands of multiscale training vectors are collected from each training image. (Bottom) In later training sessions, the training vectors are made up of samples from the previously estimated probability map as well as the original image.


Training images were obtained as follows: We acquired from each object multiple images from different viewpoints. We took scenes containing only two or three objects on the table, and scenes containing all objects on the table. Then we ran the fixation-based segmentation algorithm, using both the RGB image and the depth map, to automatically obtain the mask for the objects. Almost all objects could be extracted this way. The algorithm failed, as to be expected, for transparent objects, and provided two segmented regions for some objects, that were flat and consisted of two part, such as the knife, which consists of a handle and a blade. Those objects were labeled by hand, using a matlab annotation tool, that we also developed.


The object filter could only detect a small class of objects. Figure 2 illustrates results on an image containing the ‘big bowl’, the ‘salad spoon’, and the ‘small bowl’. Referring to the figure, ‘yellow’ denotes the ‘big bowl’ and ‘turquoise’ denotes the ‘salad spoon’, both of which have been detected well by the algorithm. The ‘small bowl’ is encoded as ‘magenta’, but as can be seen, none of the filters responded with high probability, and thus the area appears as a mix of colors. Nevertheless, although the algorithm is confused about the identity of that object, there is no doubt that there is an object there. This is the power of the filter. Next, we couple this with the torque mechanism, to make sure we don’t miss any objects.

Figure 2: Results of object filtering.


D. Summerstay and Y. Aloimonos (2010), Learning to recognize with anisotropic kernels, Proc. BICA (Biologically Inspired Cognitive Architectures) Workshop 2010, Arlington, VA, November 2010.

Y. Bengio, Learning deep architectures for AI, Foundations and Trends in Machine Learning, 1-127, 2009