Problem Description and Data Collection

Action recognition from sensory input has long been a hot topic in both biological and computer vision fields. The traditional practice of action recognition in computer vision arena treats the problem as a classification task where the input is a chunk of previously segmented video and the output space includes a bag of candidate action labels. Treating the problem in this way is not natural, since it does not require the built system to take the sensory input as a continuous stimuli and constantly updating its prediction.

On the other side, as a human being, during the observation of another person doing action, our system is constantly updating the belief about which is the intended action the other person is going to perform. We need this capability to be pro-active, in the sense that we could thus react during or even before the other party's action being performed. For example, based on the observation of a person's hand pose when he or she is approaching a knife, we could easily tell whether he or she is malicious or just going to deliver the knife. This prevents potential danger if the other party is malicious.

The recent success of applying pre-trained convolutional neural network (CNN) and recurrent neural network (RNN) sheds light on training such a pro-active system from limited number and weakly annotated data. In this project, we focused on human manipulation actions, where each subject (six in total) is asked to use one of the six different tools to perform five various actions. The goal of this project is to train a biologically inspired system that could update its belief distribution over the five candidate actions ever since the subject start to move his or her hand (hands). Further more, we want to quantitatively measure how well our trained system could correctly predict before the subject's hand even touch the targeted tool. Results from the computational study is compared with a psychophysics study which asks human subjects to do the prediction. In the final demo, we deployed a real time integrated system which observes a human subject doing actions and gave out prediction starting from beginning. At the same time, whenever the system reaches a strong belief that a certain action is going to be performed, it will give the prediction by verbalizing it. Further study along this line of research is well poised from these seed results we achieved from this workshop.

The Method

1. Preprocessing

We use a hand tracker based on mean shift to get the position of the hand. We then crop the image patches centered at the hand, so our input are the sequence of hand patches.

2. Feature Extraction

We apply a pre-trained Convolutional Neural Network (CNN) to extract features. To be specific, we use the vgg net with 16 layers to get the image features. Each hand patch is projected into a 4096 dimension feature vector.

3. Model Training

Based on the CNN features, we train a Long short-term memory (LSTM) model for the attention prediction task. The LSTM is a recursive neural network architecture, that are more stable in training and more suitable for long time series prediction task. (Please refer to the LSTM tutorial for more information:  http://deeplearning.net/tutorial/lstm.html)

The chart below shows the structure of our RNN model. We followed the traditional method and train the LSTM model using stochastic gradient descend. We perform backpropagation at each frame with an additional gain W. W is linear interpolated from [0, 1] over all frames, so that the later events in the sequence can have more impact. In training stage, we use the same class label for all the frames in one sequence. Then in the testing stage, we generate confidence of different actions for each frame, so that we can predict the actions frame by frame.