Manipulation Actions

Understanding human actions and activities is a very challenging task. It is not a task of vision only. Humans can understand what others are doing, because they have models of actions and activities. They understand the goals of actions, and this allows them to interpret their observations despite the large variations in which actions can be executed and variations in visual conditions. Our group is taking an interdisciplinary approach that combines Vision with AI and Language tools and execution on robots. We describe complex actions as events at multiple time scales. At the lowest level, signals are chunked into primitive symbolic events, and these are then combined into increasingly more complex events of longer and longer time spans. That is, similar to language, actions have hierarchical recursive structure. Complex events can be decomposed into sequences of simple, primitive events that are innate and have sensory-motoric representations. Based on these ideas, we have developed formalisms and demonstrated systems for action interpretation in video. We parse the video with visual recognition methods to create abstract symbolic representations of the ongoing action. These same representations are the high-level descriptions for the robot's action. Thus they are the basis for generalization. Our goal in Robotics is to advance collaborative robotics and create visually learning robots. Using the video on its cameras, a robot observes humans performing actions on objects with their hands, builds appropriate representations, and functionally replicates these actions. Under the hood lie many problems on what these representations are, and how and what we can transfer from a human to a robot.  

Vision for Video, Language, and Cognition

The above video shows the parsing of a video via the action grammar and the vision processes under the hood.

Learning for action-based scene understanding

Cornelia Fermüller and Michael Maynord
Advanced Methods and Learning in Computer Vision, January, 2022  

Chapter Abstract
In this chapter we outline an action-centric framework which spans multiple time scales and levels of abstraction, producing both action and scene interpretations constrained towards action consistency. At the lower level of the visual hierarchy we detail affordances – object characteristics which afford themselves to different actions. At mid-levels we model individual actions, and at higher levels we model activities through leveraging knowledge and longer term temporal relations.

Forecasting Action through Contact Representations from First Person Video

Eadom Dessalene, Chinmaya Devaraj, Michael Maynord, Cornelia Fermüller, and Yiannis Aloimonos
IEEE Transactions on Pattern Analysis and Machine Intelligence, January, 2021. 

Paper Abstract
Human actions involving hand manipulations are structured according to the making and breaking of hand-object contact, and human visual understanding of action is reliant on anticipation of contact as is demonstrated by pioneering work in cognitive science. Taking inspiration from this, we introduce representations and models centered on contact, which we then use in action prediction and anticipation. We annotate a subset of the EPIC Kitchens dataset to include time-to-contact between hands and objects, as well as segmentations of hands and objects. Using these annotations we train the Anticipation Module, a module producing Contact Anticipation Maps and Next Active Object Segmentations - novel low-level representations providing temporal and spatial characteristics of anticipated near future action. On top of the Anticipation Module we apply Egocentric Object Manipulation Graphs (Ego-OMG), a framework for action anticipation and prediction. Ego-OMG models longer term temporal semantic relations through the use of a graph modeling transitions between contact delineated action states. Use of the Anticipation Module within Ego-OMG produces state-of-the-art results, achieving 1 st and 2 nd place on the unseen and seen test sets, respectively, of the EPIC Kitchens Action Anticipation Challenge, and achieving state-of-the-art results on the tasks of action anticipation and action prediction over EPIC Kitchens. We perform ablation studies over characteristics of the Anticipation Module to evaluate their utility.

Computer Vision and Natural Language Processing: Recent approaches in Multimedia and Robotics

Peratham Wiriyathammabhum, Douglas Summers-Stay, Cornelia Fermüller, and Yiannis Aloimonos
ACM Computing Surveys (CSUR) 49 (4), 71, 2017. 

Paper Abstract
Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.

Image Understanding Using Vision and Reasoning through Scene Description Graphs

Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos, and Cornelia Fermüller
Computer Vision and Image Understanding , Dec. 2017.

Paper Abstract Webpage
Two of the fundamental tasks in image understanding using text are caption generation and visual question answering. This work presents an intermediate knowledge structure that can be used for both tasks to obtain increased interpretability. We call this knowledge structure Scene Description Graph (SDG), as it is a directed labeled graph, representing objects, actions, regions, as well as their attributes, along with inferred concepts and semantic (from KM-Ontology), ontological (i.e. superclass, hasProperty), and spatial relations. Thereby a general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG. The architecture is implemented using generic visual recognition techniques and commonsense reasoning to extract graphs from images. The utility of the generated SDGs is demonstrated in the applications of image captioning, image retrieval, and through examples in visual question answering. The experiments in this work show that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.

DeepIU: An Architecture for Image Understanding

Somak Aditya, Chitta Baral, Yezhou Yang, Yiannis Aloimonos, and Cornelia Fermüller
Advances in Cognitive Systems 4, 2016. 

Paper Abstract
Image Understanding is fundamental to systems that need to extract contents and infer concepts from images. In this paper, we develop an architecture for understanding images, through which a system can recognize the content and the underlying concepts of an image and, reason and answer questions about both using a visual module, a reasoning module, and a commonsense knowledge base. In this architecture, visual data combines with background knowledge and; iterates through visual and reasoning modules to answer questions about an image or to generate a textual description of an image. We first provide motivations of such a Deep Image Understanding architecture and then, we describe the necessary components it should include. We also introduce our own preliminary implementation of this architecture and empirically show how this more generic implementation compares with a recent end-to-end Neural approach on specific applications. We address the knowledge-representation challenge in such an architecture by representing an image using a directed labeled graph (called Scene Description Graph). Our implementation uses generic visual recognition techniques and commonsense reasoning1 to extract such graphs from images. Our experiments show that the extracted graphs capture the syntactic and semantic content of an image with reasonable accuracy.

The cognitive dialogue: A new model for vision implementing common sense reasoning

Yiannis Aloimonos and Cornelia Fermüller
Image and Vision Computing, 34, 42-44, 2015.

Paper Abstract
We propose a new model for vision, where vision is part of an intelligent system that reasons. To achieve this we need to integrate perceptual processing with computational reasoning and linguistics. In this paper we present the basics of this formalism.

Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos From the World Wide Web

Yezhou Yang, Yi Li, Cornelia Fermüller, and Yiannis Aloimonos.
The Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. 

Paper Abstract
In order to advance action generation and creation in robots beyond simple learned schemas we need computational tools that allow us to automatically interpret and represent human actions. This paper presents a system that learns manipulation action plans by processing unconstrained videos from the World Wide Web. Its goal is to robustly generate the sequence of atomic actions of seen longer actions in video in order to acquire knowledge for robots. The lower level of the system consists of two convolutional neural network (CNN) based recognition modules, one for classifying the hand grasp type and the other for object recognition. The higher level is a probabilistic manipulation action grammar based parsing module that aims at generating visual sentences for robot manipulation. Experiments conducted on a publicly available unconstrained video dataset show that the system is able to learn manipulation actions by “watching” unconstrained videos with high accuracy.

Learning the Semantics of Manipulation Action

Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos, and Eren Erdal Aksoy.
The 53rd Annual Meeting of the Association for Computational Linguistics (ACL) . 2015.  

Paper Abstract
In this paper we present a formal computational framework for modeling manipulation actions. The introduced formalism leads to semantics of manipulation action and has applications to both observing and understanding human manipulation actions as well as executing them with a robotic mechanism (e.g. a humanoid robot). It is based on a Combinatory Categorial Grammar. The goal of the introduced framework is to: (1) represent manipulation actions with both syntax and semantic parts, where the semantic part employs λ-calculus; (2) enable a probabilistic semantic parsing schema to learn the lambda-calculus representation of manipulation action from an annotated action corpus of videos; (3) use (1) and (2) to develop a system that visually observes manipulation actions and understands their meaning while it can reason beyond observations using propositional logic and axiom schemata. The experiments conducted on a public available large manipulation action dataset validate the theoretical framework and our implementation..

Visual common-sense for scene understanding using perception, semantic parsing and reasoning

Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermüller, and Yiannis Aloimonos
The Twelfth International Symposium on Logical Formalization on Commonsense Reasoning, 2015.

Paper Abstract
In this paper we explore the use of visual commonsense knowledge and other kinds of knowledge (such as domain knowledge, background knowledge, linguistic knowledge) for scene understanding. In particular, we combine visual processing with techniques from natural language understanding (especially semantic parsing), common-sense reasoning and knowledge representation and reasoning to improve visual perception to reason about finer aspects of activities.

A Cognitive System for Understanding Human Manipulation Actions.

Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos, and Anupam Guha
Advances in Cognitive Systems 3, 67–86, 2014.  

Paper Abstract
This paper describes the architecture of a cognitive system that interprets human manipulation actions from perceptual information (image and depth data) and that includes interacting modules for perception and reasoning. Our work contributes to two core problems at the heart of action understanding: (a) the grounding of relevant information about actions in perception (the perception-action integration problem), and (b) the organization of perceptual and high-level symbolic information for interpreting the actions (the sequencing problem). At the high level, actions are represented with the Manipulation Action Grammar, a context-free grammar that organizes actions as a sequence of sub events. Each sub event is described by the hand, movements, objects and tools involved, and the relevant information about these factors is obtained from biologicallyinspired perception modules. These modules track the hands and objects, and they recognize the hand grasp, objects and actions using attention, segmentation, and feature description. Experiments on a new data set of manipulation actions show that our system extracts the relevant visual information and semantic representation. This representation could further be used by the cognitive agent for reasoning, prediction, and planning.

A Corpus-Guided Framework for Robotic Visual Perception.

Yezhou Yang, Ching L. Teo, Hal Daumé III, Cornelia Fermüller, and Yiannis Aloimonos
AAAI Workshop on Language-Action Tools for Cognitive Artificial Agents, 2011 

Paper Abstract
We present a framework that produces sentence-level summarizations of videos containing complex human activities that can be implemented as part of the Robot Perception Control Unit (RPCU). This is done via: 1) detection of pertinent objects in the scene: tools and direct-objects, 2) predicting actions guided by a large lexical corpus and 3) generating the most likely sentence description of the video given the detections. We pursue an active object detection approach by focusing on regions of high optical flow. Next, an iterative EM strategy, guided by language, is used to predict the possible actions. Finally, we model the sentence generation process as a HMM optimization problem, combining visual detections and a trained language model to produce a readable description of the video. Experimental results validate our approach and we discuss the implications of our approach to the RPCU in future applications.

Active scene recognition with vision and language

Xiaodong Yu, Cornelia Fermüller, Ching-Lik Teo, Yezhou Yang, and Yiannis Aloimonos
IEEE Int. Conference on Computer Vision (ICCV), 810-817, 2011.  

Paper Abstract
This paper presents a novel approach to utilizing high level knowledge for the problem of scene recognition in an active vision framework, which we call active scene recognition. In traditional approaches, high level knowledge is used in the post-processing to combine the outputs of the object detectors to achieve better classification performance. In contrast, the proposed approach employs high level knowledge actively by implementing an interaction between a reasoning module and a sensory module (Figure 1). Following this paradigm, we implemented an active scene recognizer and evaluated it with a dataset of 20 scenes and 100+ objects. We also extended it to the analysis of dynamic scenes for activity recognition with attributes. Experiments demonstrate the effectiveness of the active paradigm in introducing attention and additional constraints into the sensing process.

Language Models for Semantic Extraction and Filtering in Video Action Recognition

Evelyne Tzoukermann, Jan Neumann, Jana Kosecka, Cornelia Fermüller, Ian Perera, Frank Ferraro, Ben Sapp, Rizwan Chaudhry, Gautam Singh.
The Twenty-Ninth AAAI Conference on Artificial Intelligence, 2011.  

Paper Abstract
The paper addresses the following issues: (a) how to represent semantic information from natural language so that a vision model can utilize it? (b) how to extract the salient textual information relevant to vision? For a given domain, we present a new model of semantic extraction that takes into account word relatedness as well as word disambiguation in order to apply to a vision model. We automatically process the text transcripts and perform syntactic analysis to extract dependency relations. We then perform semantic extraction on the output to filter semantic entities related to actions. The resulting data are used to populate a matrix of co-occurrences utilized by the vision processing modules. Results show that explicitly modeling the co-occurrence of actions and tools significantly improved performance.

Robotics, Vision, and Cognition

See our robot as she learns how to make the drink by observing the person.

The videos show some of our robot's capabilities.

Fast Task-Specific Target Detection via Graph Based Constraints Representation and Checking

Wentao Luan, Yezhou Yang, Cornelia Fermüller, and John Baras.
International Conference on Robotics and Automation (ICRA), 2017.

Paper Abstract
We present a framework for fast target detection in real-world robotics applications. Considering that an intelligent agent attends to a task-specific object target during execution, our goal is to detect the object efficiently. We propose the concept of early recognition, which influences the candidate proposal process to achieve fast and reliable detection performance. To check the target constraints efficiently, we put forward a novel policy which generates a sub-optimal checking order, and we prove that it has bounded time cost compared to the optimal checking sequence, which is not achievable in polynomial time. Experiments on two different scenarios: 1) rigid object and 2) non-rigid body part detection validate our pipeline. To show that our method is widely applicable, we further present a human-robot interaction system based on our non-rigid body part detection.

What Can I Do Around Here? Deep Functional Scene Understanding for Cognitive Robots

Chengxi Ye, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos.
International Conference on Robotics and Automation (ICRA), 2017 

Paper Abstract Project page
For robots that have the capability to interact with the physical environment through their end effectors, understanding the surrounding scenes is not merely a task of image classification or object recognition. To perform actual tasks, it is critical for the robot to have a functional understanding of the visual scene. Here, we address the problem of localizing and recognition of functional areas from an arbitrary indoor scene, formulated as a two-stage deep learning based detection pipeline. A new scene functionality testing-bed, which is complied from two publicly available indoor scene datasets, is used for evaluation. Our method is evaluated quantitatively on the new dataset, demonstrating the ability to perform efficient recognition of functional areas from arbitrary indoor scenes. We also demonstrate that our detection model can be generalized onto novel indoor scenes by cross validating it with the images from two different datasets.

Co-active Learning to Adapt Humanoid Movement for Manipulation.

Ren Mao, John Baras, Yezhou Yang, and Cornelia Fermüller
EEE-RAS International Conference on Humanoid Robots (Humanoids), 2016 
The International Journal of Robotics Research

Paper Abstract
Abstract— In this paper we address the problem of robot movement adaptation under various environmental constraints interactively. Motion primitives are generally adopted to generate target motion from demonstrations. However, their generalization capability is weak while facing novel environments. Additionally, traditional motion generation methods do not consider the versatile constraints from various users, tasks, and environments. In this work, we propose a co-active learning framework for learning to adapt robot end-effector’s movement for manipulation tasks. It is designed to adapt the original imitation trajectories, which are learned from demonstrations, to novel situations with various constraints. The framework also considers user’s feedback towards the adapted trajectories, and it learns to adapt movement through human-in-the-loop interactions. The implemented system generalizes trained motion primitives to various situations with different constraints considering user preferences. Experiments on a humanoid platform validate the effectiveness of our approach.

Manipulation Action Tree Bank: A Knowledge Resource for Humanoids.

Yezhou Yang, Anupam Guha, Cornelia Fermüller, and Yiannis Aloimonos.
IEEE-RAS International Conference on Humanoid Robots, Humanoids. 2014 

Paper Abstract Tree Bank
— Our premise is that actions of manipulation are represented at multiple levels of abstraction. At the high level a grammatical structure represents symbolic information (objects, actions, tools, body parts) and their interaction in a temporal sequence, and at lower levels the symbolic quantities are grounded in perception. In this paper we create symbolic high-level representations in the form of manipulation action tree banks, which are parsed from annotated action corpora. A context free grammar provides the grammatical description for the creation of the semantic trees. Experiments conducted on the tree banks show that they allow to 1) generate so-called visual semantic graphs (VSGs), 2) compare the semantic distance between steps of activities and 3) discover the underlying semantic space of an activity. We believe that tree banks are an effective and practical way to organize semantic structures of manipulation actions for humanoids applications. They could be used as basis for 1) automatic manipulation action understanding and execution and 2) reasoning and prediction during both observation and execution. The knowledge resource follows the widely used Penn Tree Bank format.

Learning Hand Movements from Markerless Demonstrations for Humanoid Tasks.

Ren Mao, Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos, and John S. Baras
IEEE/RAS International Conference on Humanoid Robots, 2014. 

Paper Abstract
We present a framework for generating trajectories of the hand movement during manipulation actions from demonstrations so the robot can perform similar actions in new situations. Our contribution is threefold: 1) we extract and transform hand movement trajectories using a state-of-the-art markerless full hand model tracker from Kinect sensor data; 2) we develop a new bio-inspired trajectory segmentation method that automatically segments complex movements into action units, and 3) we develop a generative method to learn task specific control using Dynamic Movement Primitives (DMPs). Experiments conducted both on synthetic data and real data using the Baxter research robot platform validate our approach.

Robots with Language: Multi-Label Visual Recognition Using NLP

Yezhou Yang, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos.
IEEE International Conference on Robotics and Automation, ICRA. 2013. 

Paper Abstract
— There has been a recent interest in utilizing contextual knowledge to improve multi-label visual recognition for intelligent agents like robots. Natural Language Processing (NLP) can give us labels, the correlation of labels, and the ontological knowledge about them, so we can automate the acquisition of contextual knowledge. In this paper we show how to use tools from NLP in conjunction with Vision to improve visual recognition. There are two major approaches: First, different language databases organize words according to various semantic concepts. Using these, we can build special purpose databases that can predict the labels involved given a certain context. Here we build a knowledge base for the purpose of describing common daily activities. Second, statistical language tools can provide the correlations of different labels. We show a way to learn a language model from large corpus data that exploits these correlations and propose a general optimization scheme to integrate the language model into the system. Experiments conducted on three multi-label everyday recognition tasks support the effectiveness and efficiency of our approach, with significant gains in recognition accuracies when correlation information is used.

Minimalist Plans for Interpreting Manipulation Actions

A Guha, Y Yang, C Fermüller, Y Aloimonos
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013 

Paper Abstract
— Humans attribute meaning to actions, and can recognize, imitate, predict, compose from parts, and analyse complex actions performed by other humans. We have built a model of action representation and understanding which takes as input perceptual data of humans performing manipulatory actions and finds a semantic interpretation of it. It achieves this by representing actions as minimal plans based on a few primitives. The motivation for our approach is to have a description, that abstracts away the variations in the way humans perform actions. The model can be used to represent complex activities on the basis of simple actions. The primitives of these minimal plans are embodied in the physicality of the system doing the analysis. The model understands an action under observation by recognising which plan is occurring. Using primitives thus rooted in its own physical structure, the model has a semanticist and causal understanding of what it observes. Using plans, the model considers actions as well as complex activities in terms of causality, compositions, and goal achievement, enabling it to perform complex tasks like prediction of primitives, separation of interleaved actions and filtering of perceptual input. We use our model over an action dataset involving humans using hand tools on objects in a constrained universe to understand an activity it has not seen before in terms of actions whose plans it knows of. The model thus illustrates a novel approach of understanding human actions by a robot.

Using a Minimal Action Grammar for Activity Understanding in the Real World

Douglas Summers-Stay, Ching L. Teo, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 

Paper Abstract Dataset
There is good reason to believe that humans use some kind of recursive grammatical structure when they recognize and perform complex manipulation activities. We have built a system to automatically build a tree structure from observations of an actor performing such activities. The activity trees that result form a framework for search and understanding, tying action to language. We explore and evaluate the system by performing experiments over a novel complex activity dataset taken using synchronized Kinect and SR4000 Time of Flight cameras. Processing of the combined 3D and 2D image data provides the necessary terminals and events to build the tree from the bottom-up. Experimental results highlight the contribution of the action grammar in: 1) providing a robust structure for complex activity recognition over real data and 2) disambiguating interleaved activities from within the same sequence.

Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Ching L. Teo, Yezhou Yang, Hal Daumé III, Cornelia Fermüller, and Yiannis Aloimonos
IEEE International Conference on Robotics and Automation, ICRA. 2012 

Paper Abstract Dataset
For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hardcoded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM’s Watson in Jeopardy!. In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.

Vision Processes Essential for Understanding Manipulations

Topology-Aware Non-Rigid Point Cloud Registration

Konstantinos Zampogiannis, Cornelia Fermüller, Yiannis Aloimonos.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 

The work introduces a non-rigid point cloud registration pipeline, with special attention to topological scene changes.

Paper Abstract Project page
In this paper, we introduce a non-rigid registration pipeline for pairs of unorganized point clouds that may be topologically different. Standard warp field estimation algorithms, even under robust, discontinuity-preserving regularization, tend to produce erratic motion estimates on boundaries associated with ‘close-to-open’ topology changes. We overcome this limitation by exploiting backward motion: in the opposite motion direction, a ‘close-to-open’ event becomes ‘open-to-close’, which is by default handled correctly. At the core of our approach lies a general, topology-agnostic warp field estimation algorithm, similar to those employed in recently introduced dynamic reconstruction systems from RGB-D input. We improve motion estimation on boundaries associated with topology changes in an efficient post-processing phase. Based on both forward and (inverted) backward warp hypotheses, we explicitly detect regions of the deformed geometry that undergo topological changes by means of local deformation criteria and broadly classify them as ‘contacts’ or ‘separations’. Subsequently, the two motion hypotheses are seamlessly blended on a local basis, according to the type and proximity of detected events. Our method achieves state-of-the-art motion estimation accuracy on the MPI Sintel dataset. Experiments on a custom dataset with topological event annotations demonstrate the effectiveness of our pipeline in estimating motion on event boundaries, as well as promising performance in explicit topological event detection.

Prediction of Manipulation Actions

Cornelia Fermüller, Fang Wang, Yezhou Yang, Kostas Zampogiannis, Yi Zhang, Francisco Barranco, and Michael Pfeiffer
International Journal of Computer Vision, 126 (2-4), 358-374, 2018. 

We studies fine-grained, similar manipulation actions using vision and force measurements, at what point in time we can predict them, and whether forces recorded during performance of the actions can help in the recognition from vision only.

Paper Abstract Project page
By looking at a person’s hands, one can often tell what the person is going to do next, how his/her hands are moving and where they will be, because an actor’s intentions shape his/her movement kinematics during action execution. Similarly, active systems with real-time constraints must not simply rely on passive video-segment classification, but they have to continuously update their estimates and predict future actions. In this paper, we study the prediction of dexterous actions. We recorded videos of subjects performing different manipulation actions on the same object, such as “squeezing”, “flipping”, “washing”, “wiping” and “scratching” with a sponge. In psychophysical experiments, we evaluated human observers’ skills in predicting actions from video sequences of different length, depicting the hand movement in the preparation and execution of actions before and after contact with the object. We then developed a recurrent neural network based method for action prediction using as input image patches around the hand. We also used the same formalism to predict the forces on the finger tips using for training synchronized video and force data streams. Evaluations on two new datasets show that our system closely matches human performance in the recognition task, and demonstrate the ability of our algorithms to predict in real time what and how a dexterous action is performed.

Seeing Behind The Scene: Using Symmetry To Reason About Objects in Cluterred Environments

A. Ecins, C. Fermüller, and Y. Aloimonos.
International Conference on Intelligent Robots (IROS), Oct 2018.

Rotational and reflectional symmetries are detected by fitting symmetry axes/planes to smooth surfaces extracted from pointclouds, and they are then used for object segmentation.

Paper Abstract Project page
Symmetry is a common property shared by the majority of man-made objects. This paper presents a novel bottom-up approach for segmenting symmetric objects and recovering their symmetries from 3D pointclouds of natural scenes. Candidate rotational and reflectional symmetries are detected by fitting symmetry axes/planes to the geometry of the smooth surfaces extracted from the scene. Individual symmetries are used as constraints for the foreground segmentation problem that uses symmetry as a global grouping principle. Evaluation on a challenging dataset shows that our approach can reliably segment objects and extract their symmetries from incomplete 3D reconstructions of highly cluttered scenes, outperforming state-of-the-art methods by a wide margin.

cilantro: a lean, versatile, and efficient library for point cloud data processing

Konstantinos Zampogiannis, Cornelia Fermüller, and Yiannis Aloimonos.
ACM Multimedia 2018 Open Source Software Competition, October 2018. . 

The library provides functionality that covers low-level point cloud operations, spatial reasoning, various methods for point cloud segmentation and generic data clustering, flexible algorithms for robust or local geometric alignment, model fitting, as well as powerful visualization tools..

Paper Abstract Project page Code
We introduce cilantro, an open-source C++ library for geometric and general-purpose point cloud data processing. The library provides functionality that covers low-level point cloud operations, spatial reasoning, various methods for point cloud segmentation and generic data clustering, flexible algorithms for robust or local geometric alignment, model fitting, as well as powerful visualization tools. To accommodate all kinds of workflows, cilantro is almost fully templated, and most of its generic algorithms operate in arbitrary data dimension. At the same time, the library is easy to use and highly expressive, promoting a clean and concise coding style. cilantro is highly optimized, has a minimal set of external dependencies, and supports rapid development of performant point cloud processing software in a wide variety of contexts.

Grasp Type Revisited: A Modern Perspective on A Classical Feature for Vision

Yezhou Yang, Cornelia Fermüller, Yi Li, and Yiannis Aloimonos.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015.

The usefulness of recognizing the grasp type is demonstrated in the two tasks of interpretating action intention and segmentation of fine-grained action videos.

Paper Abstract Page with dataset
The grasp type provides crucial information about human action. However, recognizing the grasp type from unconstrained scenes is challenging because of the large variations in appearance, occlusions and geometric distortions. In this paper, first we present a convolutional neural network to classify functional hand grasp types. Experiments on a public static scene hand data set validate good performance of the presented method. Then we present two applications utilizing grasp type classification: (a) inference of human action intention and (b) fine level manipulation action segmentation. Experiments on both tasks demonstrate the usefulness of grasp type as a cognitive feature for computer vision. This study shows that the grasp type is a powerful symbolic representation for action understanding, and thus opens new avenues for future research.

Learning the Spatial Semantics of Manipulation Actions through Preposition Grounding

Konstantinos Zampogiannis, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos
IEEE Int'l Conference on Robotics and Automation, 2015. 

We introduce an abstract representation for manipulation actions that is based on the evolution of the spatial relations between involved objects.

Paper Abstract Project page
In this project, we introduce an abstract representation for manipulation actions that is based on the evolution of the spatial relations between involved objects. Object tracking in RGBD streams enables straightforward and intuitive ways to model spatial relations in 3D space. Reasoning in 3D overcomes many of the limitations of similar previous approaches, while providing significant flexibility in the desired level of abstraction. At each frame of a manipulation video, we evaluate a number of spatial predicates for all object pairs and treat the resulting set of sequences (Predicate Vector Sequences, PVS) as an action descriptor. As part of our representation, we introduce a symmetric, timenormalized pairwise distance measure that relies on finding an optimal object correspondence between two actions. We experimentally evaluate the method on the classification of various manipulation actions in video, performed at different speeds and timings and involving different objects. The results demonstrate that the proposed representation is remarkably descriptive of the high-level manipulation semantics.

Affordance Detection of Tool Parts from Geometric Features

Austin Myers, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos.
IEEE International Conference on Robotics and Automation, ICRA. 2015.

An essential cue for object affordances is shape. We created a 3D database of household tools with the affordances of their parts annotated, and provided methods to learn patch-based affordances from shape. 

Paper Abstract Project page
As robots begin to collaborate with humans in everyday workspaces, they will need to understand the functions of tools and their parts. To cut an apple or hammer a nail, robots need to not just know the tool’s name, but they must localize the tool’s parts and identify their functions. Intuitively, the geometry of a part is closely related to its possible functions, or its affordances. Therefore, we propose two approaches for learning affordances from local shape and geometry primitives: 1) superpixel based hierarchical matching pursuit (S-HMP); and 2) structured random forests (SRF). Moreover, since a part can be used in many ways, we introduce a large RGB-Depth dataset where tool parts are labeled with multiple affordances and their relative rankings. With ranked affordances, we evaluate the proposed methods on 3 cluttered scenes and over 105 kitchen, workshop and garden tools, using ranked correlation and a weighted F-measure score [26]. Experimental results over sequences containing clutter, occlusions, and viewpoint changes show that the approaches return precise predictions that could be used by a robot. S-HMP achieves high accuracy but at a significant computational cost, while SRF provides slightly less accurate predictions but in real-time. Finally, we validate the effectiveness of our approaches on the Cornell Grasping Dataset for detecting graspable regions, and achieve state-of-the-art performance.

Detection of Manipulation Action Consequences (MAC).

Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos
IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

A high-level categorization of action consequences is introduced, along with a method for detection that relies on coupling tracking with segmentation to detect topological changes. 

Paper Abstract Dataset Code
The problem of action recognition and human activity has been an active research area in Computer Vision and Robotics. While full-body motions can be characterized by movement and change of posture, no characterization, that holds invariance, has yet been proposed for the description of manipulation actions. We propose that a fundamental concept in understanding such actions, are the consequences of actions. There is a small set of fundamental primitive action consequences that provides a systematic high-level classification of manipulation actions. In this paper a technique is developed to recognize these action consequences. At the heart of the technique lies a novel active tracking and segmentation method that monitors the changes in appearance and topological structure of the manipulated object. These are then used in a visual semantic graph (VSG) based procedure applied to the time sequence of the monitored object to recognize the action consequence. We provide a new dataset, called Manipulation Action Consequences (MAC 1.0), which can serve as testbed for other studies on this topic. Several experiments on this dataset demonstrates that our method can robustly track objects and detect their deformations and division during the manipulation. Quantitative tests prove the effectiveness and efficiency of the method.