Problem Description and Data Collection

We were interested in how humans perform on the task of viusal action prediction. Subjects were shown videos of the same handpatches that were used in our computational algorithm and asked to judge which action they perceived.

The Method Subjects were shown three of the objects (cup, sponge, and spoon) with five actions on each object. Videos were of four lenghts: (-10, 0, 10, 25) frames from the contact point (the time when the actor first touches the object). A matlab gui was created. Subjects first were shown demonstrations, then they were shown videos and asked to judge which action they perceived by selecting one of five pushbottons on a menu. In a first experiments, subjects were shown for each of the three objects 40 videos and they were not given feedback. 20 subjects participated. In a second experiment, subjects were shown 4 times 40 videos for each object, and they were given feedback on which was the correct action. This experiment was designed to evaluate the learning performance. Two subject participated. The Fig below shows the matlab gui


The following figure provides the statistics for the first experiment. The subplot on the top left shows the average success rate (over all objects and subjects) for different times. At -10 frames before contact classification accuracy is at chance. At contact point it is at 0.3 (out of 1) and it increases to nearly 0.9 at +25 frames after contact. The subplot on the top right shows the overall classification for different objects (best for sponge, second best for cup, and third best for spoon). The success rate for individual objects is detailed in the figure at the bottom.

The next figure shows the same statistics for the second experiment, when subjects were given feedback. The four bars in each figure show the average classification accuracy after the first, second, third and fourth block of videos. As can be seen, subjects did improve slightly their classification accuracy.

The last figure compares the classification results from our RNN algorithm (using as input image patches) with the classification of the subjects in the first experiment. As can be seen, the prediction of the algorithm is worse, but for two of the actions it reaches very comparable performance. For the cup and the sponge after 25 frames, the algorithmic performance is between 70 and 80 percent, and for human subjects it is around 80 to 90 percent.