Manipulation Actions: Movements, Forces, and Affordances

Members: arindam basu, Ashley Kleinhans, Andreas Andreou, Brandon Carroll, Diederik Paul Moeys, David Reverter Valeiras, Bert Shi, Eric Hunsberger, Francisco Barranco, Garrick Orchard, Paul Isaacs, Jonathan Tapson, Jie Jack Zhang, Kate Fischl, Kan Li, Konstantinos Zampogiannis, Nobuhiro Hagura, Ernst Niebur, Michael Pfeiffer, Ralph Etienne-Cummings, Shih-Chii Liu, Soumyajit Mandal, Timmer Horiuchi, Yezhou Yang, Zhaokang CHEN, Zonghua Gu

Organizers:: Cornelia Fermuller (U Maryland), Michael Pfeiffer (Univ. of Zurich), Ryad Benjamin Benosman (UPMC, Institut de la Vision), Andreas Andreou (Johns Hopkins)

Focus and goals of this topic area

This topic area is centered on manipulation actions, in particular, hand movements and the forces applied to the hands. We will also explore how we can recognize from vision and the force measurement actions, and how these actions and forces relate to the manipulation of objects. Sub-projects will be structured around three possible areas.

1. Event-based Vision for Action Recognition: Using the DVS and ATIS camera in combination with conventional cameras, we will explore motion signatures of manipulation actions. This amounts to applying learning approaches to relate the visual data to the recognition of specific manipulation actions.

ATIS image

2. Relating hand movements, forces, and actions: : We will bring a data glove to collect data of the movement of the hand and fingers for different actions, and we will bring another glove (a device we will make) to collect force measurements on different parts of the hand. We also have Kinect cameras and the software to track the human body motion (skeletal tracker) and the hand motion (hand trackers). Some project ideas include:

• Observing with cameras people performing actions, we would like to predict the intention of a person’s action and the outcome of the action early on. This capability to predict is essential, if we want to create real-time vision systems in applications of human robot collaboration. According to the psychological literature three components are involved in understand others’ action intention: the context, the hand grasp at contact, and the kinematics of the hand/arm (see attached paper: (Ansuni et.al, 2015). Along this idea, we would like to develop methods to predict the intention of similar actions, or classify actions and predict the trajectory/location.

• Another project will focus on segmenting the observed action sequence into individual action segments in time. We observe with a Kinect camera a person at a bar mixing a drink. The person mixes different juices to create his/her drink. He/she can intentionally try to fool the system by performing movements that are not related to the drink mixing. We would like recognize the whole activity of the person, and separate the action segments contributing to the drink mixing from others.

• Relating the measurements of trajectories and forces to objects, specifically to fine-grain descriptions of objects, that is their attributes, in order to recognize these attributes. For example the amount of force one applies to a knife as (s)he cuts, and the speed with which his hand moves, is related to how soft or hard the manipulated object is (here the object that is cut). Or similarly, the forces and movements also encode information about the shape and size of an object.

• Using force measurements in addition to visual observations for action learning. The idea is that during training we have available in addition to visual data also data from the force measurements. During inference (testing) we only have the visual data available. We expect that having both kinds of data (if the two modalities are partially independent), we can learn a better classifier. We can try different learning mechanisms, such as privileged learning (see attached paper: Vapnik, 2015)

Picture of data glove

3. Neural representations of actions and sensory-motor conflicts: Another venue is to relate brain and muscle activity measurements to the observations of actions. This would be a collaboration with the Human Auditory Cognition group, and we will invite EEG experts ourselves. Specifically, we will explore how action trajectories and force measurements for specific manipulation actions are encoded in EEG and EMG data, and how to decode actions from brain and muscle signals. Neural network dynamics for action recognition in sub-project 1, e.g. using deep or reservoir computing models, will generate signature trajectories of actions that can e.g. be related to brain and muscle activity.

We will also utilize data goggles and EMG to investigate the interplay of movement, proprioception, and vision. We will use data goggles to induce sensory-motor conflicts and compare behavior, and muscle activty in coherent or incoherent scenarios.

A possible project involves looking into the interaction of vision and sound and possibly taction with respect to materials. There is a strong audiovisual interaction (similar as in the McGurk? effect) also with materials. For example if one sees glass and hears paprika, he thinks he experiences plastic; or if one sees bark and hears metal, he experiences ceramic (see abstract below: Not glass but plastic). This could tell us something about how the different modalities (vision and sound) organize materials, how they are combined, and how we learn. A possible project is to look at the EEG measurements of a person when he perceives one or both modalities, and apply learning algorithms to model this effect.

Data goggles + EMG

4. Bio-inspired Human Action Recognition from low Dimensional Time Series Sensor Data: Representation Learning and Inference Using Generative Probabilistic Graphical Models: The first step towards engineering machines capable of smoothly interacting with human beings is the development of sensory processing systems that can operate in natural environments and recognize human actions. To do this reliably, we must engineer solutions to robustly handle the wide range of complexity exhibited by actions and the contextual variability resulting from both the objects involved in the action and the vantage point of the observer. To overcome these challenges, biology has evolved several mechanisms that we have incorporated into a novel sensory data acquisition platform.

The first is to combine information from multiple sensory modalities. The type of information easily extracted from each modality is best suited to solving a particular set of sub-tasks towards an overall system solution. The second is temporal coherence, which has been shown to be a primary cue for binding related sensory data into a cohesive unit both within and between various sensor modalities. Finally, we employ arrays of similar sensors in order to deal with the geometrical complexities of three dimensional environments.

One of the primary scientific questions addressed by this project which is central to pattern analysis and machine intelligence is whether having a symbolic model of the physical objects of interest is sufficient to bootstrap low-dimensional, impoverished sensory data allowing machines to make inferences about a complex, higher-dimensional, structures in the natural environment. Building on the capabilities of our multimodal sensory acquisition system, we use a methodology for learning the hierarchical structure of human actions. The structure of action is in many ways analogous to the structure of human languages, suggesting that the generative approaches widely used by automatic speech recognition systems can be adapted to our action learning task. In this instance the low dimensional time-series data are active acoustics from a micro-Doppler sensor that include no or very limited spatial information, and the high dimensional data is RGB-Depth skeleton data from a Microsoft Kinect sensor.

The task is that of human action recognition from the active acoustic data. To accomplish this, a dictionary of human actions and symbolic representations of skeletal poses is learned from the high dimensional Kinect data. Complementary to this the rich temporal structure of the micro-Doppler modulations are learned in generative models (HMM) that are linked to the dictionary of actions. During runtime, the model then relies purely on low dimensional data (micro-Doppler active acoustics) to infer the human action, without using any vision data. We will extend the algorithms developed with the sonar micro-Doppler to the low dimensionality time series data collected with an event based sensor such as ATIS, DVS and DAVIS. We will participate in data collection and processing from object manipulations in a kitchen scene context.

Faculty of the Topic Area

Name Institution Expertise Time Website
Cornelia FermullerUniversity of Maryland Computer Vision, Human Vision 28 June-18 July  WWW
Michael PfeifferInstitute of Neuroinformatics (UZH/ETH Zurich) Machine Learning, Computational Neuroscience 28 June-18 July  WWW
Ryad Benjamin BenosmanInstitut de la Vision (UPMC Paris) Event-based Vision, EEG 28 June-18 July  WWW
Andreas Andreou Johns Hopkins University Bio-inspired Pattern Analysis and Machine Intelligence 28 June-18 July  WWW
Arko GhoshInstitute of Neuroinformatics (UZH/ETH Zurich) Sensations, Movement, Brain Plasticity 4 July-9 July  WWW
Nobuhiro Hagura Kyoto University Psychophysics, Cognitive Science 4 July - 9 July
Bert Shi Hong Kong Univ. of Science and Technology Neuromorphic Engineering, Machine Vision  WWW
'Fang Wang'NICTA, Canberra Computer Vision 29 June - 18 July  WWW
Yezhou YangUniversity of Maryland Computer Vision, Robotics 28 June - 11 July  WWW
Yiannis AloimonosUniversity of Maryland Computer Vision, Robotics 11 July - 18 July  WWW



Result Page

Lectures, Tutorials, and Slides

  • 29 June: MFA field tutorial - TBA
  • 1 July: MFA tutorial - TBA
  • 9 July: Action Recognition without a Camera, A.G. Andreou Lecture

Available Hardware and Equipment

  • Conventional cameras
  • Silicon Retinas (ATIS, DVS, DAVIS)
  • Kinect
  • Data glove for movement and force measurements
  • Epson Moverio BT-200 Data Goggles
  • EMG sensors for muscle activity
  • Tactile stimulator
  • EEG (TBD)
  • Three micro-Doppler sonar units (28kHz, 32kHz and 40kHz) with wireless beacon for simultaneous data acquisition
  • Kinect sensor capable of wireless synchronization with micro-Doppler units.

Recommended Reading and Resources