Reasoning and Verbalisation

Members: 'Katerina Pastra', Eirini Balta


We have developed a language-based reasoner for implementing the cognitive dialogue between the language executive and the visual and auditory executives in the visual scene understanding scenario. The reasoner takes as input one or more triplets of the form ‘tool-action-affected object’ and provides as output a verbal description of the visual scene analyzed. Its task is to judge the cognitive plausibility of the information provided in the triplet(s), i.e. to check whether the action and objects recognised by the perceptual tools could possibly form the constitutents of a human action.

For example, let’s consider the following triplet: ‘knife – slice – tomato’. The reasoner checks whether a slicing action could affect a ‘tomato’ and whether a ‘knife’ could play the role of its instrument. For doing so, the reasoner makes use of the PRAXICON conceptual knowledge base (Pastra et al. 2011) which captures –among others- common sense knowledge on entities, movements and perceptual object/action features, and their interrelations. This information has been mined from text collections and lexical databases, such as WordNet (Fellbaum 1992). The reasoner searches for a relation in PRAXICON that links the three concepts in the triplet with ‘action-tool’ and ‘action-object’ relations, the ‘action’ being common in both cases. If such relation is found, the perceived information in the triplet is considered cognitively plausible and the reasoner verbalises the description of the action in a sentence such as: ‘someone slices a tomato with a knife’. If not, it proceeds to suggesting alternatives as follows:

(a) first, it checks whether the tool and the object are connected with these roles through one (or more) other action(s); if so, the reasoner suggests that this/or one of these actions are more probable to take place in the scene. Thus, it asks the perceptual tools to reconsider their action recognition results, getting more data from the scene through active perception. For example, for the input triplet: ‘knife – pour – tomato’, the reasoner indicates the implausible action and suggests that a ‘slicing’ action should be expected rather than a ‘pouring’ one.

(b) If the recognised tool and affected object cannot be linked through any action in these roles, the reasoner checks for cases when only one of the objects (the tool or the affected object) may be linked as such with the action provided in the triplet. If such case is found, it suggests alternative objects or tools respectively that should be expected in the scene, rather than the one provided in the triplet. For example, for the input triplet ‘knife-slice-pitcher’, the reasoner indicates that the tool and the action are cognitively plausible, whereas the affected object is not. In this case, it will suggest a number of objects that could substitute the affected object in the triplet, initiating the perceptual tools to reconsider their results.

(c) If none of the entities provided in the triplet can be linked to the action recognised, the reasoner suggests both tools and affected objects that are commonly associated with the action recognised. For example, in the input triplet ‘pitcher-slice-knife’, the reasoner takes the action for granted and suggests objects that could successfully substitute the tool and affected object arguments of the triplet.

The reasoner can deal with cases when no values have been provided for one or even two arguments of the triplet. It generally follows the above strategy in these cases too, with the only difference being that it gets null values rather than cognitively implausible ones.

It ranks many different alternatives following a number of criteria that take advantage of the hierarchical structure of language (taxonomic, isA relations between concepts) and the automatic identification of the ‘basic-level’ of verbal categorisation (Rosch 1978) in such hierarchies within the PRAXICON. In particular, it ranks the alternatives as follows:

(a) if the wrong concept was an entity/movement: it lists its ‘sister concepts’ first (i.e. all concepts that have the same basic level concept parent) and their children (ascending by distance), then its basic level concept (BL) parent, then the ‘sister concepts’ of its basic level parent and their children (ascending by distance) and then all the rest;

(b) if the wrong concept was a BL one: it lists its basic level ‘sister concepts’ first and then all rest.

Last, the reasoner is able to handle a number of cases of ‘strange’ or imprecise object recognition results that are inherent due to the perceptual limitations of visual and auditory tools. Consider for example the triplet: ‘spoon-stir-mug’. It seems that the affected object (the mug) is wrongly recognised, since one does not stir the mug with a spoon but rather the contents of the mug. However, in stirring the contents e.g. coffee, one affects the container too; perceptually, it is very difficult for e.g. visual tools to recognise liquids or other substances. The mug is a more easily identifiable affected object. In such cases, the reasoner can indicate whether the original implausibility of the triplet is due to a container-content case, and does not provide false alarms to the perceptual tools. In the verbalisation of the description of such scene it explains that e.g. ‘someone stirs coffee (or a liquid in general) with a spoon in a mug’. Part-whole cases are also treated similarly, making sure that common sense knowledge on the relation of parts of an object within an action can safely be generalised to the whole object (but not the other way around).

The reasoner is written in Java and has been implemented as a web service that can be called by a client application with the above mentioned input. The output comprises a verdict on the cognitive plausiblity of each input triplet, alternative triplets in case of wrong or incomplete input and a verbalisation of the information from the triplet if the input is plausible. This output can be used by the perceptual tools for continuing the analysis of the visual scene in an interactive mode. The reasoner is also able to compare information provided for the same action by different perceptual tools, i.e. in different triplets and come up with a description of the scene or suggestions of alternatives that take into consideration all sources of information. Here is an example:

Input to the reasoner web service

visual: pitcher slice cucumber
auditory: null chop cucumber

Output from the reasoner

visual: wrong
alternative triplets: knife slice tomato
verbalisation: none

auditory: wrong
alternative triplets: knife chop cucumber
verbalisation: none

Final Verbalisation:

“In this scene, the visual modules recognise that someone slices a cucumber with a pitcher. The auditory modules recognise that someone chops a cucumber with something. Both the visual and the auditory modules provided me with cognitively implausible or incomplete descriptions of the scene. From their input I can suggest that someone slices or chops a cucumber with a knife.”


Fellbaum C. (ed.) (1998) ‘WordNet?: An Electronic Lexical Database’ The MIT Press, Cambridge, MA.
K. Pastra, E. Balta, P. Dimitrakis, G. Karakatsiotis (2011), “Embodied Language Processing: a new generation of language technology”, in Proceedings of the AAAI 2011 International Workshop on “Language-Action Tools for Cognitive Artificial Agents: Integrating Vision, Action and Language”, San Francisco, USA.

Rosch E. (1978) ‘Principles of Categorization’, In E. Rosch and B. Lloyd (eds.) ‘Cognition and Categorization’, chapter 2, pp. 27-48., Lawrence Erlbaum Associates.