Auditory recognition


Liquid state machine

'Jeff Sprenger'

Software implementation of a Liquid State Machine in Matlab to perform basic word recognition from speech processed by the silicon cochlea. A report of the project findings are included in the PDF document (LSM-Sprenger2010.pdf) at the end of this page (see attachments).

Feature detector on critical bands

Tara Julia Hamilton

Members: Tara Julia Hamilton with help from Jeff Sprenger, Michael Rapson and Andrew Schwartz.

Introduction: In this project we took the spiking output from the silicon cochlea and passed it through a spiking neural network in order to recognise words; specifically the number and suit of 52 playing cards. The network divides the 64 cochlea channels into 15 critical bands. These bands are then examined for particular features, for example, “s”, “a”, “ee” and so on. Feature detectors are then used to extract particular words using the features.

Spiking Neurons: In order to create a spiking neural network, spiking neurons need to be coded. In this network a simple leaky integrate-and-fire (LIF) neuron was used. To be compatible with future, hardware versions of the network, a “digital” neuron was used. The code for this neuron is given below:

function [Sout,Vmem,I] = LIF(Sin, Tau, Vth, R, Vreset, Ilast, Vmem_last)

I = Sin; % This is the “synapse”. Here the input spike (Sin)= current. Vmem = (1/(1+Tau))*(I*R + Ilast*R + Vmem_last*(Tau - 1));

if (Vmem > Vth) % Reset condition

Sout = 1; % Output spike (Sout) Vmem = Vreset;


Sout = 0;


The digital neuron is defined as follows. The continuous-time, analogue LIF neuron is defined as:

V_mem=R/(τs+1) I,

by substituting the Laplace operator with the z-transform operator which is equivalent using the following relationship:

s=(1-z-1 ) /(1+z-1 ) ,

we obtain:

V_mem=1/(τ+1) [IR(1+z-1 ) + z-1 V_mem (τ-1) ].

The term, z-1 , is the delay operator and therefore a memory of the previous values of I and Vmem. This simple neuron model contains no adaptation in the synapse or the threshold level. Such extensions to the model, however, may be required in the future.

The output of the neuron when the input is constant for a period of time and then turned off, is given below in Figure 1.

Figure 1. Top: input to the neuron, Sin. Middle: Spikes out, Sout. Bottom: Membrane voltage, Vmem. The threshold voltage, Vth, is given in red. The x-axis is time step and the y-axis is the magnitude of the signal in constant units.

Figure 2. The output from the binaural silicon cochlea for the utterance "ace of hearts".

Spiking output from the cochlea: The silicon cochlea designed by Shih-Chii Liu provides outputs using address-event representation (AER). In this output representation both the magnitude of the analogue output of the cochlea and the timing information of the sound signal is encoded using 4 silicon neurons per channel. This silicon cochlea is a binaural cochlea, meaning that there are two cochleae. In this project we only used one silicon cochlea (specifically the left cochlea) and one of the neurons for each channel. In order to be more accurate a future improvement to this design would be to utilize the information from both cochleae and from all four neurons per channel (using averaging, for example).

The output from the silicon cochlea for the utterance “Ace of hearts” is given in Figure 2. Here we see both the left (red) and the right (green) cochleae output and the output from the 4 neurons per channel of each of these. The first neuron for each channel is coloured grey.

The output for the same utterance using only the left cochleae and the first neuron for each channel is given in Figure 3. Here it can be seen that very little information is lost. In both Figure 2 and Figure 3 the x-axis represents time while the y-axis represents the address of the neurons.

Figure 3. The output of the first neuron in each channel of the left silicon cochleae for the utterance "ace of hearts".

Critical Bands: Looking at the cochlea “spikogrammes” in Figure 2 and Figure 3, it is clear that different information in the speech is represented in different channels. Since the cochlea has rather wide filters, often the same or similar information can be found in adjacent channels. Thus, critical bands were defined. These critical bands were set up such that 8 adjacent channels were input to a single neuron whose output served as the critical band output. Each critical band has an overlap of 4 channels. For example, critical band 1 has channels 1 to 8 input to it while critical band 2 had channels 5 to 12. Thus, 15 critical bands were required. The structure of the critical bands is shown in Figure 4.

Figure 4. Structure of the critical bands.

The output from critical bands 1 to 4 is shown in Figure 5. Here we see that the “s” sound in “ace” and “hearts” is found in bands 1 and 2, while the “s” in “hearts” can be seen also in bands 3 and 4. This information can be used to detect particular features in particular words.

Figure 5. Spiking output from critical bands 1 to 4 for the utterance "ace of hearts".

Feature Detection: In order to detect particular features in the speech the critical bands were examined for differences in their outputs. For example, Figure 5 shows that the “s” sound in “ace” can only be identified in critical bands 1 and 2. Thus, in order to detect the word “ace”, an “a” sound followed by a “s” sound is required.

By inspection of the other 11 critical bands, the “a” sound in “ace” was found to be most emphasised in critical band 9. In order to find the sequence: “a” followed by “s”, the “a” sound, once discovered, triggers a tonic spiking neuron. A tonic spiking neuron is one which continues to fire when a particular input is detected. Thus, once the “a” sound is found a tonic spiking neuron is activated. When the “s” sound is discovered in critical bands 1 and 2, an “ace” has been detected. The neurons used for the detection of the word “ace” are illustrated in Figure 6.

Figure 6. Feature detector for the word "ace".

In Figure 6, the large circles represent neurons while the small neurons represent synaptic connections; white: excitatory, black: inhibitory. The output of the neurons depicted in Figure 6 is shown in Figure 7.

Using similar feature detectors words such as “king”, “six”, “seven”, “spades” and so on can be detected. Unfortunately there are a number of words: “two”, “three”, “four” etc. which are very hard to detect simply by dividing the sound into critical bands. Figure 8 shows the similarities between the “spikogrammes” of the utterances “two of hearts” (left) and “three of hearts” (right). Clearly, other methods are required in order to filter out the similarities in features between these words.

Figure 7. Output from the neurons used in the detection of the word "ace". Top: output from critical band 9. Middle top: output from the tonic spiking neuron. Middle bottom: combined output of channels 1 and 2. Bottom: “ace” detection neuron.

Figure 8. Showing the similarities between different utterances. Left: the utterance "two of hearts". Right: the utterance "three of hearts".

Future Work: Looking at the timing or more elaborate synaptic weightings or neurons should be looked at in order to improve the recognition abilities of this network. One of the road-blocks to doing this successfully is the lack of a “spiking” neural network library that allows neurons to be tuned simply and for different types of neurons to be easily instantiated. Currently, it is extremely time-consuming to individually tune neurons in a spiking network. Easy ways to incorporate learning or particular neuronal features such as bursting or tonic spiking are essential if complex, spiking neural networks are to become a useful tool for doing computation.

Coincidence on spikes

Nima Mesgarani Shih-Chii Liu

We used a coincidence measure of spikes between different channels as a way of coding for cochlea features. In Figure 1, we show the spikes from the channels in response to the Ace of Clubs (top) and the histogram of these spikes in a bin time of 5ms (bottom). We then applied two masks to these histograms for the 52 cards. The segmented times for the value and the suit in each card phrase was then executed and then a cross-correlation between the channels for the 4 suits of the 52 cards (see Figure 2 for the cross-correlation between the cards on Hearts). The complete cross-correlation matrix for the 4 suits and the 13 values are shown in Figure 3.

Figure 1. Top. Spike responses for "Ace of Clubs". Bottom. Binned spike rates for a bin size of 5 ms.

The cross-correlation matrices show that certain suits are easily distinguishable (e.g. spades) and the same goes for the value cards.

Figure 2. Cross-correlation matrix between different channels for the word "Hearts".

Figure 3. Left. Cross-correlation matrix between the suits. Right. Cross-correlation matrix between the values.

One-shot learning

Hynek Hermansky Nima Mesgarani

We used a test test and a training set of utterances on the 52 cards as spoken by one person (scSS1 and scSS2 data files). We compute average firing rates over whole end-pointed utterance in 4 frequency bands and approximate the rate vector by a truncated Fourier series. The Euclidean distance between the Fourier coefficients of the series indicate the similarity between the training utterance and the test utterance.

spike rate figures

Figure 1. Firing rates for 2 of the 4 bands. Right column are rates for the test set and the left column are rates for the training set.

Figure 2. Distance matrix between the test and training sets.

The best recognition accuracy on this single speaker over the 52 card names is 40.4%.

Point process models

'Aren Jansen'

An investigation was conducted into the use of point process models of the the output of the silicon cochlea to produce word transcriptions of the playing card recordings. A detection-based architecture was adopted (Jansen and Niyogi, 2010), where each card rank (e.g. ace, two, jack) and face (e.g. spades, diamonds) was independently modeled and detected in the continuous speech recordings. While the individual word detectors are prone to false alarms, standard speech recognition techniques are capable of reducing them a the single prediction of the word sequence with maximum likelihood. In particular, the collection of rank and face detections were combined into a lattice of possible utterances and Viterbi decoded to produce a single prediction of the word sequence.

The cochlear spike trains observed for each rank and face word was modeled as inhomogeneous Poisson processes with time-dependent rate parameters $\lambda_{iw}(t)$ for cochlear channel $i$ and word $w$ (Jansen and Niyogi, 2009). These rate parameters are parametrically modeled as piecewise constant functions of time with 20 equal divisions of the word duration. Examples of these rate parameter function for the words 'two' and 'spades' are shown in the figure. Dark regions correspond to channels and times within each word with a heightened rate of cochlear activity. The background rate of each channel is modeled as a homogeneous Poisson process with one constant rate parameter for each cochlear channel, representing the mean firing rate across all the speech.

For each word $w$, the two models are combined into a word detector function defined by the log of the ratio between the word likelihood and background likelihood as each point in time. When this log likelihood ratio peaks above a preset threshold, a candidate detection for the word is added to the lattice to be consider as a viable path in the Viterbi decode.

While there were many kinks in the implementation for which there was not adequate time to address, the prototype was still capable of recognizing approximately 50% of the individual words correctly and recognized ~20% of the cards completely correct (i.e., both the rank and face was correct). Note that this model-based system is capable of detecting novel utterances; however, since it was trained with a single speaker, speaker independent competence should not be assumed. Future refinement of the system design, in conjunction with larger collections of training data, promise substantial improvements to these preliminary results.


A. Jansen and P. Niyogi. Detection-Based Speech Recognition with Sparse Point Process Models. In Proceedings of ICASSP 2010.

A. Jansen and P. Niyogi. Point Process Models for Spotting Keywords in Continuous Speech. IEEE Transactions on Audio, Speech, and Language Processing, 2009.

ISI feature detector

Shih-Chii Liu John Harris

We investigated whether where there was information in the ISI histograms of the cochlea spikes for different vowels. The output of the vowel detection process can then be used for segmenting the words. ISI histograms of each channel was computed in a running bin time of 20 ms and histograms of different channels were then combined together. The plots of the histograms for the different vowels show that there are differences between the vowels.

Figure 1. ISI histogram of spike channels for the 5 vowels spoken by a single person.

Figure 2. Freq histogram derived from the ISI histogram.

Dynamic time warping

John Harris

We used the dynamic time warping to extract cepstral features from spike rate histograms binned over 50ms. The performance of the same set of 52 cards used for training and testing is 100% (no error). However, when testing these features against a separate test set, the performance was not great. We will explore the reasons for this discrepancy in the future.

Figure 1. Top left. Spike responses for "Ace of Clubs". Bottom left. Binned spike rates for a bin size of 50 ms. Top right and bottom right. Spike responses and binned spike rates for same card phrase but different training set.

Figure 2. Top left. Extracted cepstral features for "Ace of Clubs" in training set. Bottom left. Dynamic time warping path in comparing 2 cards. Top right. Extracted cepstral features for "Ace of Clubs" in test set. Bottom right. Confusion matrix for all 52 cards.