Monitoring Complex Manipulation Activities

Members: Cornelia Fermüller, Andreas Andreou, Antonis Argyros, Yiannis Aloimonos, Bert Shi, Yi Li, Francisco Barranco, Michael Pfeiffer, Ching Lik Teo, Yezhou Yang, Fang Wang, Aleksandrs Ecins, Austin Myers, Thomas Murray

The goal of this project was to demonstrate a complete system that interprets a complex manipulation activity performed by a human using visual input. Specifically, given the description of an activity, our goal was to develop a system with vision that monitors a human performing this activity, and gives warnings if the action is performed incorrectly. The task involves the development of a variety of components, including the interpretation of hand motion, body motion, object recognition, and a checking of correct object assembly. We developed novel modules for each of these tasks. However, the emphasis of this project was to show how to put together the different components into one system.

The specific task was as follows: The system is given descriptions of a number of complex tasks in the carpentry domain, such as how to make a picture frame, how to make a coat hanger, or how to make a book stop. We describe a task as a sequence of actions. For example, “making a picture frame” consists of the six actions: ‘Mark the plank’, ‘Prepare the plank for cutting’, ‘Cut the plank’, ‘Align the pieces in a right angle’, ‘Attach a screw’, ‘Attach a nail’. Each of the actions is described as a sequence of steps, and each step is described by a number of observations, which are: the grasp of the right and left hand, tools in the right and left hand, objects in the right and left hand, and the action of the upper body. The separation of the sequence into steps requires a dynamic segmentation. We define the transition between steps: when either the hand comes in contact with an object, when the hand releases an object, when an object comes in contact with another object, or when the action changes.

To monitor the activity, recognize the activity, and identify at any point in point the current state of the activity, an HMM is used. Input to the HMM are the observations of the grasp, objects, and actions in every frame.

As sensors we used two Kinects (rgb-depth sensors). The different processes utilize either the images only, or the combined images and depth maps. The process computing hand grasp, uses a 3D model-based hand tracker based on images and depth maps to estimate the pose of the hand at every instance of time. Then the dimensionality of the parameters describing the full-hand pose is reduced, and using this reduced space the hand pose is classified into one of four categories. Action recognition is realized by estimating the pose of the upper body from images only. The pose then is classified as belonging to one of five action categories. The recognition of objects uses images only and is based on contours. Objects on the table are recognized at the beginning of each action. The hand tracking module passes to the object recognition a message whenever the hand grasps an object. Then objects in the hand and close by are tracked and their identities are computed at regular time intervals. In each action, we also perform a check on whether the action has been performed correctly. The plank is recognized from 3D data, its position and shape is estimated. This information is used to perform a number of specific checks, such as whether the two cut plank pieces are of same length, whether they are orthogonal, etc.

Technological impact

An immediate application of these ideas would be for developing an intelligent system that assists with manufacturing. The system is given the description of the manufacturing task to be performed, either in form of a script, or through demonstrations by an expert. It then monitors the worker to check whether he is safe and performs the task correctly, and it warns him if he deviates from the task description. In general, being able to interpret humans performing actions in natural environments using vision, will be useful in many applications for cognitive robotics. For example, robots assisting humans with daily activities, such as household chores, or robots assisting the elderly or sick, need to able to interpret the actions humans perform.

Intellectual contributions

The contributions are along two lines. First, through the individual modules we address issues related the action-perception loop. The individual modules study and implement visual representations related to action understanding. For example, we utilize for the first time, an accurate hand-tracker, and utilize the grasp of the hand as a cue for describing the interaction between actions and objects. New pose models capturing the relative position of the different parts of body are used for action description. Novel contour representations utilizing mid-level operators are used for object recognition. The velocity and trajectory of hand-motion is used to predict to which object the hand is moving. For the first time, we attempt to model correct assembly and error-making in assembly. We also discussed and started work on exploiting for the purpose of recognition the relationship of shape and form of objects to the affordances objects, and how the concept of affordance can be used to learn to segment objects into their functional parts. Second, we study, how to represent complex activities as a sequence of simple actions. Because of the large variation in the way manipulation actions can be performed and the possibly large variation in visual appearance, it is essential that we have a good way of segmenting the sequence of observations in time. We propose a description that combines discrete, symbolic information with the observed visual signals, and represents activities at multiple level of abstraction.

The process in monitoring an action here parallels what happens when humans observe manipulation activities. First, attention is drawn to the agent (the hand). Then attention is drawn to the object where the hand is going (object recognition and tracking). Then attention is back to the hand/agent, monitoring the movement (action recognition). Finally, it is checked whether the grasp and action are successful. (goal satisfaction)