A cognitive robot detecting objects using sound, language, and vision

Members: Aleksandrs Ecins, Adam McLeod, Ching Teo, Daniel B. Fasnacht, Eirini Balta, Francisco Barranco, Cornelia Fermuller, John Harris, Kailash Patil, Malcolm Slaney, Mounya Elhilali, Michael Pfeiffer, Ryad Benjamin Benosman, Shih-Chii Liu, Tomas Figliolia, Timmer Horiuchi, Tobi Delbruck, Troy Lau, Yezhou Yang

- Organized by Cornelia Fermuller, Yiannis Aloimonos, & Andreas Andreou

Cornelia Fermuller cornelia.fermuller@… 26-Jun 16-Jul
Yiannis Aloimonos yiannis@… 26-Jun 16-Jul
Andreas Andreou andreou@… 29-Jun 18-Jul
Ryad Benjamin Benosman benjry.benos@… 25-Jun 16-Jul
'Katerina Pastra' kpastra@… 26-Jun 16-Jul
Eirini Balta ebalta@… 26-Jun 16-Jul
'Hui Ji' matjh@… 5-Jul 16-Jul
'Ajay Mishra' mishraka@… 1-Jul 8-Jul
'Douglas Summerstay' dsummerstay@… 30-Jun 9-Jul
'Austin Meyers' amyers@… 26-Jun 3-Jul

Related tutorial Please go to 2011/ros11 to download ROS-related software.

Problem description:

We propose to study the interaction between sound, high level knowledge (in form of language) and visual processes for solving the problem of object recognition for an embodied system. We envision a system that has the same major cognitive components that humans have to solve this problem. These include 1.) speech understanding, 2.) a high-level cognitive system (in form of language that can reasons about object properties 3.) a vision system which segments the image regions corresponding to objects and extracts visual properties of these regions based on 2D visual appearance and shape attributes 4.) an attention mechanism, which using information from language and vision decides where in the image/ video to focus on next and what information to extract 5.) a memory structure organizing object knowledge.

To demonstrate the ideas we would like to combine the different components into one system and solve the following problem: A robot is given in spoken language the names of objects, and finds these objects using his vision system in a room. We plan to bring our robot equipped with a laser range sensor, sonar ring, and a pan-tilt unit carrying a stereo rig of four colour cameras. The robot has the software for basic navigation capabilities (obstacle avoidance, path planning) and building a map of the place.

Relationship to previous work:

While visual object recognition is a heavily studied problem in Computer Vision, the current framework for this research is not anthropomorphic, but data-base driven. Current object recognition approaches, without segmentation of the scene into regions corresponding to objects, passively search the image with templates of appearance-based feature descriptors. Success largely is due to advances in learning techniques. A few studies recently considered additional information for recognition from labeled images transcripts and language resources, but they treated language simply as a contextual system. In contrast here, we would like to study the interaction between signal processing (vision and sound) and higher cognitive processes (language processing) and implement them for a system with an attention mechanism in an active approach.

List of specific topic area projects:

Speech processing: to understand instructions about objects

Natural Language Processing: Developing the tools to extract properties of objects to aid the visual processes. Such properties are visual attributes (color, texture, shape), object part descriptions, and information about the spatial relationship of objects in the scene

Visual processing: Segmentation of the scene into visual regions corresponding to objects and Computing 2D properties such as texture and contours, and 3D shape primitives

Attention system: Developing a framework (possibly using Information theory) for finding where to look next on the basis of higher level knowledge together with visual information

Memory: Studying which primitives of shape and 2d visual appearance characterize specific objects, and how we can organize this knowledge in a principled fashion.



In this project we plan to put together a whole system consisting of a robot with vision, sound and language (for reasoning) that recognizes human manipulation activities. A manipulation activity in this description consists of three quantities: the tool, the action, and the object. For example; “knife, cut tomato”. The robot with software, developed under ROS, looks at a scene consisting of a person performing a manipulation action and outputs a verbal description of the activity, such as: “A person cuts a tomato with a knife.”

The different components that will be developed:

1. Enriching the praxicon:

The Praxicon is a lexical resource that encodes information about the relationship of actions and objects, descriptions of actions, and descriptions of objects. In this project the Praxicon will be enhanced with information relating to the manipulation actions analyzed in this project.

2. The reasoner:

The reasoner gets as input the recognized quantities from the visual modules (tool object action) and verifies whether the combination is possible, or suggests that one of the quantities is recognized erroneously and makes suggestions. It also generates the sentence. Output is in the form:
visual: ok

alternative type: objA probability, objB probability, objC probability etc.

verbalisation: 'sentence describing the scene'


visual: ok

alternative type: none

verbalisation: 'cut the tomato with the knife'


visual: wrong

alternative tool: butter knife 1, slicer 0.5, etc.

verbalisation: none

3. Communication between the robot and the Praxicon through a web-service

4. Object and Tool segmentation

To segment objects and tools we use the fixation based algorithm by Mishra et al. 09, but using as input Kinect data (RGBD) and adapt it to run under ROS.

5. Hand detection and finding when the hand comes in contact with the tool

Hands will be located and tracked using an existing ROS package. Code needs to be developed to locate in time when the hand comes in contact with the tool and when the hand releases the tool.

6. Classifying grasp positions

Using as input point cloud data of the hand, extracted using an ROS package, the grasp pose will be classified using a small set of categories. The hand descriptor will be used as part of the action description.

7. Action description

Using as input Kinect data, an OpenNI ROS package extracts a skeleton model of the moving human. Action descriptions will be developed using as input the motion trajectory of the parts of the human skeleton along with hand pose descriptions.

8. The action parser.

A parser operating on the visual input to break the video into visual primitives.

9. Object shape descriptions

Compute a number of shape descriptors of objects and tools from the point clouds

10. Learning the mapping between language attributes and visual attributes

A mapping between visual attributes and language attributes will be learned. The language attributes are linked to objects and the actions using these objects to provide additional information used as feedback in the learning.

11. Visual Navigation

Using as input a map created with the laser sensor, odometry from the shaft encoders, and laser range data, a hippocampus model for localization will be developed.

12. Acquiring distance functions in image and sound space.

Using visual and sound recording of actions, and having a classification in image space,learn a distance function in sound space to cluster in sound space.