On Action Recognition

Yi Li, NICTA, Australia


Simple tree models for articulated objects prevails in the last decade. However, it is also believed that these simple tree models are not capable of capturing large variations in many scenarios, such as human pose estimation. This paper attempts to address three questions: 1) are simple tree models sufficient? more specifically, 2) how to use tree models effectively in human pose estimation? and 3) how shall we use combined parts together with single parts efficiently? Assuming we have a set of single parts and combined parts, and the goal is to estimate a joint distribution of their locations. We surprisingly find that no latent variables are introduced in the Leeds Sport Dataset (LSP) during learning latent trees for deformable model, which aims at approximating the joint distributions of body part locations using minimal tree structure. This suggests one can straightforwardly use a mixed representation of single and combined parts to approximate their joint distribution in a simple tree model. As such, one only needs to build Visual Categories of the combined parts, and then perform inference on the learned latent tree. Our method outperformed the state of the art on the LSP, both in the scenarios when the training images are from the same dataset and from the PARSE dataset. Experiments on animal images from the VOC challenge further support our findings. This work is primarily sponsored by Bionic Eye, a special initiative of Australian Government through Australian Research Council. A preliminary version of the paper is available  http://arxiv.org/pdf/1304.6291v1.pdf

Short Bio:

Dr. Yi Li received his Ph.D from the ECE Dept. at the University of Maryland at College Park in 2011. His PhD research, entitled “Cognitive Robots for Social Intelligence”, focus on visual navigation for mobile robots, optical motion capture, causal inference for coordinated groups, and action recognition and representation. He was the recipient Future Faculty Fellow at Maryland from 2008-2010, received the Best Student paper of ICHFR, and the second price in the Semantic Robot Vision Challenge (SRVC). He joined NICTA as a Researcher since 2011 in the Visual Processing for Bionic Eye (VIBE) project, and developed algorithms for visualizing critical information (US/AU patents pending). His recent research interests include human pose estimation, higher order loss function in machine learning, and image deblurring via sparse signal processing.