Spiking HMAX on SpiNNaker
Object recognition is one of the most challenging tasks in visual computation but our brain can solve it with high precision while most computer algorithms fail or perform very bad. One reason for this lies in the way the two computations are performed: our brain works with a massively parallel network of fuzzy computational units while a computer works with a massive precise, serially working computation unit.
One way of improving the performance of object recognition algorithms is taking biology as source of inspiration and model it's approach on a computer. The most promising model today seems to be the so called HMAX which models object recognition as a series of feature detection (edges, composite features, ...) and pooling stages (which compute the maximal activity within a set of features). This model is not only modeling the neural responses well but it also performs acceptably well on object recognition.
Another key aspect in the difference between how our brain does object recognition and how the field of computer vision does it is the nature of the input. Computer vision works with pictures or movies that are frame-based whereas our brain works with an asynchronous stream of events which has one big advantage: instead of sampling a scene over and over again and start from scratch on each frame, the eye directly encodes the presence of very simple features which allows to directly update the relevant states. This approach massively reduces the redundancy of computation and a series of sensors that work on such an even-basis already exist (such as the DVS).
Combining the power of these two approaches would allow to do object recognition in a way very close to the one of the brain. The key problem hereby is that the computations required are highly parallel and asynchronous which makes it very expensive to simulate on a classical computer. Therefore it makes sense to use another computational platform: SpiNNaker is a massively parallel computing platform ideal for the implementation of spiking neural networks.
So the goal of this project was to implement a real-time object recognition system using the output of a silicon retina by implementing a HMAX model on SpiNNaker. As task for this system we chose to classify the first 8 pictures from a illustrated alphabet: apple, butterfly, corn, dog, egg, fish, goat,
The approach we took was centered around the available SpiNNaker hardware: A Spin2 testboard that host four chips i.e. 72 cores (18 cores per chip), 64 of which usable for neural simulation. This hardware is capable of hosting several thousand neurons. However, for the purpose of this experiment we decided to usea set of Gabor filters provided by Bernabé's convolutional chips, which were already set for the purpose.
We decided to use Bernabé's FPGA implementation of his spike based convolution chip tuned to receive spikes from a spatially subampled retina (64x64) whose input was injected in nine Gabor filters configured in 3 rows (one for each size of the filter) and 3 columns (one for each orientation):
- 3 scales: 3x3, 5x5 and 7x7 input pixels
- 3 orientations: vertical, horizontal and one 45° diagonal Gabor filters
This lead to a total of 192x192 S1 neurons.
The restricted number of neurons required that we find a computationally cheap way on how to compute the max function needed for the C1 layer. We were considering following options:
- Winner-take-all-circuit: We could have tried to compute the max function by using recurrent connections to a shared inhibitory neuron (example). The main drawback with this approach is that it consumes a very large number of inputs since each max function needs a neuron for each of its inputs and an inhibitory neuron for the population. Therefore this approach was not applicable.
- An algorithmic solution: Instead of having a population to compute the max function we could have implemented a special computational unit to compute the max frequency. The simplest way to do so would be a "neuron" that has a counter for each input neuron which is increased on the arrival of a spike. As soon as one of these counters reaches a threshold, it would lead to a spike and the reset of all counters. The firing frequency of such a "neuron" would therefore be proportional to the firing frequency of the fastest firing input neuron (divided by the threshold). We were close to implement such a neuron in SpiNNaker but then a new option arose applying the Neural Engineering Framework (NEF) to the evaluation of the max function.
- Feed-forward connectivity: with the NEF it was possible to optimize the network connectivity to compute a max function in a feed-forward fashion and therefore this approach was best in terms of neural connectivity.
The C1 layer, therefore, consisted of cells that were trained to compute the max function among all three scales within an input kernel of 8x8 neurons. We tried to have half-overlapping input kernels so we ended up with 15x15 neurons for each of the 3 orientations (total of 675 C1 neurons).
We decided to acquire our S2 cells by randomly sampling patches of the C1 activity while exposing the system to the training set. The patches should span a spatial kernel of 3x3 C1 cells and span all three orientations so we would end up with 9 input neurons for each S2 cell. After trying to measure the real-time response of the C1 cells, we realized that the cable between the convolution board and the SpiNNaker board was broken. Therefore we moved all the computation onto the computer. The code was then implemented on a spike basis to run in python.
Since we had to move the whole computation to software, we decided to let the C2 stage be arithmetic max functions.
As classifier we took a support vector machine with a linear kernel. We used SciKit, a python machine learning library for that purpose. We also realized that the resolution of our model is far too low to do detailed object recognition so we reduced our training set for the classifier to three stimuli: the butterfly, a vertical bar and an array of black squares.
The resolution of the whole model was far to low so independent of the object presented to it, it produced a blob of activity - non the less we could make the system classify the three objects quite robustly.
The main lesson that we learned is that approaches such as HMAX need a lot of neurons to work properly and a reduction of these numbers comes with a high price of performance. So it does not make sense to implement HMAX on devices simulating a small number of neurons, but we did not discover any principal weaknesses of implementing a spiking real-time HMAX model. Therefore we suggest that a next step would be to use the Spin4 with 48 chips to try a new implementation of the model with a greater number of neurons available for the simulation.