Datasets and Code

Manipulation Actions

Prediction of Manipulation Actions

Cornelia Fermüller, Fang Wang, Yezhou Yang, Kostas Zampogiannis, Yi Zhang, Francisco Barranco, and Michael Pfeiffer
International Journal of Computer Vision, 126 (2-4), 358-374, (2018).. 

We studied fine-grained, similar manipulation actions using vision and force measurements, a t what point in time we can predict them, and whether forces recorded during performance of the actions can help when recognizing from vision only.

Paper Abstract Project page
By looking at a person’s hands, one can often tell what the person is going to do next, how his/her hands are moving and where they will be, because an actor’s intentions shape his/her movement kinematics during action execution. Similarly, active systems with real-time constraints must not simply rely on passive video-segment classification, but they have to continuously update their estimates and predict future actions. In this paper, we study the prediction of dexterous actions. We recorded videos of subjects performing different manipulation actions on the same object, such as “squeezing”, “flipping”, “washing”, “wiping” and “scratching” with a sponge. In psychophysical experiments, we evaluated human observers’ skills in predicting actions from video sequences of different length, depicting the hand movement in the preparation and execution of actions before and after contact with the object. We then developed a recurrent neural network based method for action prediction using as input image patches around the hand. We also used the same formalism to predict the forces on the finger tips using for training synchronized video and force data streams. Evaluations on two new datasets show that our system closely matches human performance in the recognition task, and demonstrate the ability of our algorithms to predict in real time what and how a dexterous action is performed.

What Can I Do Around Here? Deep Functional Scene Understanding for Cognitive Robots

Chengxi Ye, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos.
International Conference on Robotics and Automation (ICRA), 2017 

We propose a hierarchical categorization for functionalities (or affordances) of object parts in indoor scenes, and provide a labeled dataset as well as CNN based classification code.

Paper Abstract Project page
For robots that have the capability to interact with the physical environment through their end effectors, understanding the surrounding scenes is not merely a task of image classification or object recognition. To perform actual tasks, it is critical for the robot to have a functional understanding of the visual scene. Here, we address the problem of localizing and recognition of functional areas from an arbitrary indoor scene, formulated as a two-stage deep learning based detection pipeline. A new scene functionality testing-bed, which is complied from two publicly available indoor scene datasets, is used for evaluation. Our method is evaluated quantitatively on the new dataset, demonstrating the ability to perform efficient recognition of functional areas from arbitrary indoor scenes. We also demonstrate that our detection model can be generalized onto novel indoor scenes by cross validating it with images from two different datasets.

Grasp Type Revisited: A Modern Perspective on A Classical Feature for Vision

Yezhou Yang, Cornelia Fermüller, Yi Li, and Yiannis Aloimonos
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015.

The usefulness of recognizing the grasp type is demonstrated in the two tasks of interpretating action intention and segmentation of fine-grained action videos.

Paper Abstract Page with dataset
The grasp type provides crucial information about human action. However, recognizing the grasp type from unconstrained scenes is challenging because of the large variations in appearance, occlusions and geometric distortions. In this paper, first we present a convolutional neural network to classify functional hand grasp types. Experiments on a public static scene hand data set validate good performance of the presented method. Then we present two applications utilizing grasp type classification: (a) inference of human action intention and (b) fine level manipulation action segmentation. Experiments on both tasks demonstrate the usefulness of grasp type as a cognitive feature for computer vision. This study shows that the grasp type is a powerful symbolic representation for action understanding, and thus opens new avenues for future research.

Learning the Spatial Semantics of Manipulation Actions through Preposition Grounding

Konstantinos Zampogiannis, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos
IEEE Int'l Conference on Robotics and Automation, 2015. 

We introduce an abstract representation for manipulation actions that is based on the evolution of the spatial relations between involved objects.

Paper Abstract Project page
In this project, we introduce an abstract representation for manipulation actions that is based on the evolution of the spatial relations between involved objects. Object tracking in RGBD streams enables straightforward and intuitive ways to model spatial relations in 3D space. Reasoning in 3D overcomes many of the limitations of similar previous approaches, while providing significant flexibility in the desired level of abstraction. At each frame of a manipulation video, we evaluate a number of spatial predicates for all object pairs and treat the resulting set of sequences (Predicate Vector Sequences, PVS) as an action descriptor. As part of our representation, we introduce a symmetric, timenormalized pairwise distance measure that relies on finding an optimal object correspondence between two actions. We experimentally evaluate the method on the classification of various manipulation actions in video, performed at different speeds and timings and involving different objects. The results demonstrate that the proposed representation is remarkably descriptive of the high-level manipulation semantics.

Drone Research

GapFlyt: Active Vision Based Minimalist Structure-less Gap Detection For Quadrotor Flight

Nitin J. Sanket, Chahat Deep Singh, Kanishka Ganguly, Cornelia Fermüller, and Yiannis Aloimonos
IEEE Robotics and Automation Letters, 2018. 

An active approach for drones to navigate through a gap.

Paper Abstract Project page
—Although quadrotors, and aerial robots in general, are inherently active agents, their perceptual capabilities in literature so far have been mostly passive in nature. Researchers and practitioners today use traditional computer vision algorithms with the aim of building a representation of general applicability: a 3D reconstruction of the scene. Using this representation, planning tasks are constructed and accomplished to allow the quadrotor to demonstrate autonomous behavior. These methods are inefficient as they are not task driven and such methodologies are not utilized by flying insects and birds. Such agents have been solving the problem of navigation and complex control for ages without the need to build a 3D map and are highly task driven.
In this paper, we propose this framework of bio-inspired perceptual design for quadrotors. We use this philosophy to design a minimalist sensorimotor framework for a quadrotor to fly through unknown gaps without an explicit 3D reconstruction of the scene using only a monocular camera and onboard sensing. We successfully evaluate and demonstrate the proposed approach in many real-world experiments with different settings and window shapes, achieving a success rate of 85% at 2.5ms−1 even with a minimum tolerance of just 5cm. To our knowledge, this is the first paper which addresses the problem of gap detection of an unknown shape and location with a monocular camera and onboard sensing.

Event-based Vision

Event-based Moving Object Detection and Tracking

Anton Mitrokhin,  Cornelia Fermüller,  Chethan M Parameshwara, and Yiannis Aloimonos
IEEE International Conference on Intelligent Robots (IROS) 2018. 

A global approach for image stabilization is introduced, which then is used in an iterative approach with segmentation to detect moving objects.

Paper Abstract Project page
Event-based vision sensors, such as the Dynamic Vision Sensor (DVS), are ideally suited for real-time motion analysis. The unique properties encompassed in the readings of such sensors provide high temporal resolution, superior sensitivity to light and low latency. These properties provide the grounds to estimate motion extremely reliably in the most sophisticated scenarios but they come at a price - modern eventbased vision sensors have extremely low resolution and produce a lot of noise. Moreover, the asynchronous nature of the event stream calls for novel algorithms. This paper presents a new, efficient approach to object tracking with asynchronous cameras. We present a novel event stream representation which enables us to utilize information about the dynamic (temporal) component of the event stream, and not only the spatial component, at every moment of time. This is done by approximating the 3D geometry of the event stream with a parametric model; as a result, the algorithm is capable of producing the motion-compensated event stream (effectively approximating egomotion), and without using any form of external sensors in extremely low-light and noisy conditions without any form of feature tracking or explicit optical flow computation. We demonstrate our framework on the task of independent motion detection and tracking, where we use the temporal model inconsistencies to locate differently moving objects in challenging situations of very fast motion.

A dataset for visual navigation with neuromorphic methods

Francisco Barranco, Cornelia Fermüller, Yiannis Aloimonos, and Tobi Delbruck
Frontiers in Neuroscience, 10, 49, 2018. 

This was the first event-based dataset for navigation having 3D motion, depth, and image motion.

Paper Abstract Project page
Standardized benchmarks in Computer Vision have greatly contributed to the advance of approaches to many problems in the field. If we want to enhance the visibility of event-driven vision and increase its impact, we will need benchmarks that allow comparison among different neuromorphic methods as well as comparison to Computer Vision conventional approaches. We present datasets to evaluate the accuracy of frame-free and frame-based approaches for tasks of visual navigation. Similar to conventional Computer Vision datasets, we provide synthetic and real scenes, with the synthetic data created with graphics packages, and the real data recorded using a mobile robotic platform carrying a dynamic and active pixel vision sensor (DAVIS) and an RGB+Depth sensor. For both datasets the cameras move with a rigid motion in a static scene, and the data includes the images, events, optic flow, 3D camera motion, and the depth of the scene, along with calibration procedures. Finally, we also provide simulated event data generated synthetically from well-known frame-based optical flow datasets.

Contour Detection and Characterization for Asynchronous Event Sensors

Francisco Barranco, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos
IEEE International Conference on Computer Vision (ICCV), 2015. 

An approach for learning (the mid-level cue) of boundary and border-ownership assignment has been introduced, and the impact of different features analyzed.

Paper Abstract Project page
The bio-inspired, asynchronous event-based dynamic vision sensor records temporal changes in the luminance of the scene at high temporal resolution. Since events are only triggered at significant luminance changes, most events occur at the boundary of objects and their parts. The detection of these contours is an essential step for further interpretation of the scene. This paper presents an approach to learn the location of contours and their border ownership using Structured Random Forests on event-based features that encode motion, timing, texture, and spatial orientations. The classifier integrates elegantly information over time by utilizing the classification results previously computed. Finally, the contour detection and boundary assignment are demonstrated in a layer-segmentation of the scene. Experimental results demonstrate good performance in boundary detection and segmentation.

2D Image Operations

Viewpoint invariant texture description

Yong Xu, Hui Ji, and Cornelia Fermüller
International Journal of Computer Vision, 83 (1), 85 - 100 (2009). 

Texture descriptors based on Fractal geometry are shown to be theoretically invariant to smooth transformations, and demonstrated in algorithms on a high-resolution texture database that we collected.

Paper Abstract Project page
Image texture provides a rich visual description of the surfaces in the scene. Many texture signatures based on various statistical descriptions and various local measurements have been developed. Existing signatures, in general, are not invariant to 3D geometric transformations, which is a serious limitation for many applications. In this paper we introduce a new texture signature, called the multifractal spectrum (MFS). The MFS is invariant under the bi-Lipschitz map, which includes view-point changes and non-rigid deformations of the texture surface, as well as local affine illumination changes. It provides an efficient framework combining global spatial invariance and local robust measurements. Intuitively, the MFS could be viewed as a “better histogram” with greater robustness to various environmental changes and the advantage of capturing some geometrical distribution information encoded in the texture. Experiments demonstrate that the MFS codes the essential structure of textures with very low dimension, and thus represents an useful tool for texture classification.

Scale-space texture description on SIFT-like textons

Yong Xu, Sibin Huang, Hui Ji, Cornelia Fermüller
Computer Vision and Image Understanding, 116 (9), 999 - 1013 (2012). 

A fractal-based texture descriptor defined on complex features is demonstrated for static and dynamic texture classification.

Paper Abstract Project page
Visual texture is a powerful cue for the semantic description of scene structures that exhibit a high degree of similarity in their image intensity patterns. This paper describes a statistical approach to visual texture description that combines a highly discriminative local feature descriptor with a powerful global statistical descriptor. Based upon a SIFT-like feature descriptor densely estimated at multiple window sizes, a statistical descriptor, called the multi-fractal spectrum (MFS), extracts the power-law behavior of the local feature distributions over scale. Through this combination strong robustness to environmental changes including both geometric and photometric transformations is achieved. Furthermore, to increase the robustness to changes in scale, a multi-scale representation of the multi-fractal spectra under a wavelet tight frame system is derived. The proposed statistical approach is applicable to both static and dynamic textures. Experiments showed that the proposed approach outperforms existing static texture classification methods and is comparable to the top dynamic texture classification techniques

The image torque operator: A new tool for mid-level vision

Morimichi Nishigaki, Cornelia Fermüller, Daniel Dementhon,
IEEE International Conference on Computer Vision (CVPR), 2012

The Torque is an image processing operator that implements the Gestaltist principle of closure. This paper demonstrates the Torque for the applications of attention, boundary detection, and segmentation. 

Paper Abstract Code
Contours are a powerful cue for semantic image understanding. Objects and parts of objects in the image are delineated from their surrounding by closed contours which make up their boundary. In this paper we introduce a new bottom-up visual operator to capture the concept of closed contours, which we call the ’Torque’ operator. Its computation is inspired by the mechanical definition of torque or moment of force, and applied to image edges. The torque operator takes as input edges and computes over regions of different size a measure of how well the edges are aligned to form a closed, convex contour. We explore fundamental properties of this measure and demonstrate that it can be made a useful tool for visual attention, segmentation, and boundary edge detection by verifying its benefits on these applications.

A Gestaltist approach to contour-based object recognition: Combining bottom-up and top-down cues

Ching L Teo, Cornelia Fermüller, Yiannis Aloimonos
Advances in Computational Intelligence, 309-321, Springer International Publishing, 2015. 
The International Journal of Robotics Research

The Torque is used first in a bottom-up way to detect possible objects. Then task-driven, high-level processes modulate the Torque to recognize specific objects.

Paper Abstract Project page
This paper proposes a method for detecting generic classes of objects from their representative contours that can be used by a robot with vision to find objects in cluttered environments. The approach uses a mid-level image operator to group edges into contours which likely correspond to object boundaries. This mid-level operator is used in two ways, bottom-up on simple edges and top-down incorporating object shape information, thus acting as the intermediary between low-level and high-level information. First, the mid-level operator, called the image torque, is applied to simple edges to extract likely fixation locations of objects. Using the operator’s output, a novel contour-based descriptor is created that extends the shape context descriptor to include boundary ownership information and accounts for rotation. This descriptor is then used in a multi-scale matching approach to modulate the torque operator towards the target, so it indicates its location and size. Unlike other approaches that use edges directly to guide the independent edge grouping and matching processes for recognition, both of these steps are effectively combined using the proposed method. We evaluate the performance of our approach using four diverse datasets containing a variety of object categories in clutter, occlusion and viewpoint changes. Compared with current state-of-the-art approaches, our approach is able to detect the target with fewer false alarms in most object categories. The performance is further improved when we exploit depth information available from the Kinect RGB-Depth sensor by imposing depth consistency when applying the image torque.

Fast 2d border ownership assignment

Ching Teo, Cornelia Fermüller, Yiannis Aloimonos.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 

Local and global features, inspired from psychological studies, are used to learn contour detection and assignment of neighboring foreground and background.

Paper Abstract Project page
A method for efficient border ownership assignment in 2D images is proposed. Leveraging on recent advances using Structured Random Forests (SRF) for boundary detection, we impose a novel border ownership structure that detects both boundaries and border ownership at the same time. Key to this work are features that predict ownership cues from 2D images. To this end, we use several different local cues: shape, spectral properties of boundary patches, and semi-global grouping cues that are indicative of perceived depth. For shape, we use HoG-like descriptors that encode local curvature (convexity and concavity). For spectral properties, such as extremal edges, we first learn an orthonormal basis spanned by the top K eigenvectors via PCA over common types of contour tokens. For grouping, we introduce a novel mid-level descriptor that captures patterns near edges and indicates ownership information of the boundary. Experimental results over a subset of the Berkeley Segmentation Dataset (BSDS) and the NYU Depth V2 dataset show that our method’s performance exceeds current state-of-the-art multi-stage approaches that use more complex features.

Detection and segmentation of 2D curved reflection symmetric structurest

Ching Teo, Cornelia Fermüller, Yiannis Aloimonos.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 

Curved reflectional symmetries of objects are detected and then used for segmenting the objects.

Paper Abstract Project page
Symmetry, as one of the key components of Gestalt theory, provides an important mid-level cue that serves as input to higher visual processes such as segmentation. In this work, we propose a complete approach that links the detection of curved reflection symmetries to produce symmetryconstrained segments of structures/regions in real images with clutter. For curved reflection symmetry detection, we leverage on patch-based symmetric features to train a Structured Random Forest classifier that detects multiscaled curved symmetries in 2D images. Next, using these curved symmetries, we modulate a novel symmetryconstrained foreground-background segmentation by their symmetry scores so that we enforce global symmetrical consistency in the final segmentation. This is achieved by imposing a pairwise symmetry prior that encourages symmetric pixels to have the same labels over a MRF-based representation of the input image edges, and the final segmentation is obtained via graph-cuts. Experimental results over four publicly available datasets containing annotated symmetric structures: 1) SYMMAX-300, 2) BSD-Parts, 3) Weizmann Horse (both from) and 4) NY-roads demonstrate the approach’s applicability to different environments with state-of-the-art performance.

Shadow-Free Segmentation in Still Images Using Local Density Measure

Aleksandrs. Ecins, Cornelia Fermüller, Yiannis Aloimonos. .
International Conference on Computational Photography (ICCP), 2014 . 

A figure-ground segmentation algorithm, not affected by shadow is proposed, which specifically was designed for textured regions.

Paper Abstract Project page
Over the last decades several approaches were introduced to deal with cast shadows in background subtraction applications. However, very few algorithms exist that address the same problem for still images. In this paper we propose a figure ground segmentation algorithm to segment objects in still images affected by shadows. Instead of modeling the shadow directly in the segmentation process our approach works actively by first segmenting an object and then testing the resulting boundary for the presence of shadows and resegmenting again with modified segmentation parameters. In order to get better shadow boundary detection results we introduce a novel image preprocessing technique based on the notion of the image density map. This map improves the illumination invariance of classical filterbank based texture description methods. We demonstrate that this texture feature improves shadow detection results. The resulting segmentation algorithm achieves good results on a new figure ground segmentation dataset with challenging illumination conditions.

3D Point Clouds

Affordance Detection of Tool Parts from Geometric Features

Austin Myers, Ching L. Teo, Cornelia Fermüller, Yiannis Aloimonos.
IEEE International Conference on Robotics and Automation, ICRA. 2015.

An essential cue for object affordances is shape. We created a 3D database of household tools with the affordances of their parts annotated, and provided methods to learn patch-based affordances from shape. 

Paper Abstract Project page
As robots begin to collaborate with humans in everyday workspaces, they will need to understand the functions of tools and their parts. To cut an apple or hammer a nail, robots need to not just know the tool’s name, but they must localize the tool’s parts and identify their functions. Intuitively, the geometry of a part is closely related to its possible functions, or its affordances. Therefore, we propose two approaches for learning affordances from local shape and geometry primitives: 1) superpixel based hierarchical matching pursuit (S-HMP); and 2) structured random forests (SRF). Moreover, since a part can be used in many ways, we introduce a large RGB-Depth dataset where tool parts are labeled with multiple affordances and their relative rankings. With ranked affordances, we evaluate the proposed methods on 3 cluttered scenes and over 105 kitchen, workshop and garden tools, using ranked correlation and a weighted F-measure score [26]. Experimental results over sequences containing clutter, occlusions, and viewpoint changes show that the approaches return precise predictions that could be used by a robot. S-HMP achieves high accuracy but at a significant computational cost, while SRF provides slightly less accurate predictions but in real-time. Finally, we validate the effectiveness of our approaches on the Cornell Grasping Dataset for detecting graspable regions, and achieve state-of-the-art performance.

Cluttered Scene segmentation using the symmetry constraint

Aleksandrs Ecins, Cornelia Fermüller, Yiannis Aloimonos.
IEEE Int'l Conference on Robotics and Automation, 2015. 

Reflectional symmetry detected from pointcloud data based on contours as main cue, are used for segmenting 3D objects.

Paper Abstract Project page
Although modern object segmentation algorithms can deal with isolated objects in simple scenes, segmenting nonconvex objects in cluttered environments remains a challenging task. We introduce a novel approach for segmenting unknown objects in partial 3D pointclouds that utilizes the powerful concept of symmetry. First, 3D bilateral symmetries in the scene are detected efficiently by extracting and matching surface normal edge curves in the pointcloud. Symmetry hypotheses are then used to initialize a segmentation process that finds points of the scene that are consistent with each of the detected symmetries. We evaluate our approach on a dataset of 3D pointcloud scans of tabletop scenes. We demonstrate that the use of the symmetry constraint enables our approach to correctly segment objects in challenging configurations and to outperform current state-of-the-art approaches.

Seeing Behind The Scene: Using Symmetry To Reason About Objects in Cluttered Environments

A. Ecins, C. Fermüller, Y. Aloimonos.
International Conference on Intelligent Robots (IROS), Oct 2018

Rotational and reflectional symmetries are detected by fitting symmetry axes/planes to smooth surfaces extracted from pointclouds, and they are then used for object segmentation.

Paper Abstract Project page
Symmetry is a common property shared by the majority of man-made objects. This paper presents a novel bottom-up approach for segmenting symmetric objects and recovering their symmetries from 3D pointclouds of natural scenes. Candidate rotational and reflectional symmetries are detected by fitting symmetry axes/planes to the geometry of the smooth surfaces extracted from the scene. Individual symmetries are used as constraints for the foreground segmentation problem that uses symmetry as a global grouping principle. Evaluation on a challenging dataset shows that our approach can reliably segment objects and extract their symmetries from incomplete 3D reconstructions of highly cluttered scenes, outperforming state-of-the-art methods by a wide margin.

cilantro: a lean, versatile, and efficient library for point cloud data processing

Konstantinos Zampogiannis, Cornelia Fermüller, Yiannis Aloimonos.
ACM Multimedia 2018 Open Source Software Competition, October 2018. . 

The library provides functionality that covers low-level point cloud operations, spatial reasoning, various methods for point cloud segmentation and generic data clustering, flexible algorithms for robust or local geometric alignment, model fitting, as well as powerful visualization tools.

Paper Abstract Code
We introduce cilantro, an open-source C++ library for geometric and general-purpose point cloud data processing. The library provides functionality that covers low-level point cloud operations, spatial reasoning, various methods for point cloud segmentation and generic data clustering, flexible algorithms for robust or local geometric alignment, model fitting, as well as powerful visualization tools. To accommodate all kinds of workflows, cilantro is almost fully templated, and most of its generic algorithms operate in arbitrary data dimension. At the same time, the library is easy to use and highly expressive, promoting a clean and concise coding style. cilantro is highly optimized, has a minimal set of external dependencies, and supports rapid development of performant point cloud processing software in a wide variety of contexts.

Deep Learning

LightNet: A Versatile, Standalone Matlab-based Environment for Deep Learning

Chengxi Ye, Chen Zhao, Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos.
Proceedings of the 2016 ACM on Multimedia Conference.

LightNet is a lightweight, versatile, efficient, purely Matlab-based deep learning framework.

Paper Abstract Code
LightNet is a lightweight, versatile, purely Matlabbased deep learning framework. The idea underlying its design is to provide an easy-to-understand, easy-to-use and efficient computational platform for deep learning research. The implemented framework supports major deep learning architectures such as Multilayer Perceptron Networks (MLP), Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The framework also supports both CPU and GPU computation, and the switch between them is straightforward. Different applications in computer vision, natural language processing and robotics are demonstrated as experiments.

Several datasets that we have used in our work.