Event-based Vision

Event-based vision sensors, such as the DVS, inspired in their design by biological vision, record data in very compact form at high temporal resolution, with low latency, and high dynamic range, and these properties make then ideally suited for real-time motion analysis. So far we have focused on the fundamental capabilities of visual navigation: estimation of image motion, 3D motion, and object segmentation, and studied how spatially global and local computations interact to solve these tasks. 

SpikeMS: Deep spiking neural network for motion segmentation

Chethan M Parameshwara,   Simin Li,   Cornelia Fermüller,   Nitin J Sanket,   Matthew S Evanusa,   Yiannis Aloimonos
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021. 

Paper Abstract Project page Code
Spiking Neural Networks (SNN) are the so-called third generation of neural networks which attempt to more closely match the functioning of the biological brain. They inherently encode temporal data, allowing for training with less energy usage and can be extremely energy efficient when coded on neuromorphic hardware. In addition, they are well suited for tasks involving event-based sensors, which match the event-based nature of the SNN. However, SNNs have not been as effectively applied to real-world, large-scale tasks as standard Artificial Neural Networks (ANNs) due to the algorithmic and training complexity. To exacerbate the situation further, the input representation is unconventional and requires careful analysis and deep understanding. In this paper, we propose SpikeMS, the first deep encoder-decoder SNN architecture for the real-world large-scale problem of motion segmentation using the event-based DVS camera as input. To accomplish this, we introduce a novel spatio-temporal loss formulation that includes both spike counts and classification labels in conjunction with the use of new techniques for SNN backpropagation. In addition, we show that SpikeMS is capable of incremental predictions, or predictions from smaller amounts of test data than it is trained on. This is invaluable for providing outputs even with partial input data for low-latency applications and those requiring fast predictions. We evaluated SpikeMS on challenging synthetic and real-world sequences from EV-IMO, EED and MOD datasets and achieving results on a par with a comparable ANN method, but using potentially 50 times less power.

EVPropNet: Detecting drones by finding propellers for mid-air landing and following

Nitin J Sanket,   Chahat Deep Singh,   Chethan M Parameshwara,   Cornelia Fermüller,   Guido de Croon,   Yiannis Aloimonos
Robotics Science and Systems, 2021. 

Paper Abstract Project page Code
The rapid rise of accessibility of unmanned aerial vehicles or drones pose a threat to general security and confidentiality. Most of the commercially available or custom-built drones are multi-rotors and are comprised of multiple propellers. Since these propellers rotate at a high-speed, they are generally the fastest moving parts of an image and cannot be directly "seen" by a classical camera without severe motion blur. We utilize a class of sensors that are particularly suitable for such scenarios called event cameras, which have a high temporal resolution, low-latency, and high dynamic range. In this paper, we model the geometry of a propeller and use it to generate simulated events which are used to train a deep neural network called EVPropNet to detect propellers from the data of an event camera. EVPropNet directly transfers to the real world without any fine-tuning or retraining. We present two applications of our network: (a) tracking and following an unmarked drone and (b) landing on a near-hover drone. We successfully evaluate and demonstrate the proposed approach in many real-world experiments with different propeller shapes and sizes. Our network can detect propellers at a rate of 85.1% even when 60% of the propeller is occluded and can run at upto 35Hz on a 2W power budget. To our knowledge, this is the first deep learning-based solution for detecting propellers (to detect drones). Finally, our applications also show an impressive success rate of 92% and 90% for the tracking and landing tasks respectively.

0-mms: Zero-shot multi-motion segmentation with a monocular event camera

Chethan M Parameshwara,   Nitin J Sanket,   Chahat Deep Singh,   Cornelia Fermüller,   Yiannis Aloimonos
IEEE International Conference on Robotics and Automation (ICRA) 2021. 

Paper Abstract Project page Dataset
Segmentation of moving objects in dynamic scenes is a key process in scene understanding for navigation tasks. Classical cameras suffer from motion blur in such scenarios rendering them effete. On the contrary, event cameras, because of their high temporal resolution and lack of motion blur, are tailor-made for this problem. We present an approach for monocular multi-motion segmentation, which combines bottom-up feature tracking and top-down motion compensation into a unified pipeline, which is the first of its kind to our knowledge. Using the events within a time-interval, our method segments the scene into multiple motions by splitting and merging. We further speed up our method by using the concept of motion propagation and cluster keyslices. The approach was successfully evaluated on both challenging real-world and synthetic scenarios from the EV-IMO, EED, and MOD datasets and outperformed the state-of-the-art detection rate by 12%, achieving a new state-of-the-art average detection rate of 81.06%, 94.2% and 82.35% on the aforementioned datasets. To enable further research and systematic evaluation of multi-motion segmentation, we present and open-source a new dataset/benchmark called MOD++, which includes challenging sequences and extensive data stratification in-terms of camera and object motion, velocity magnitudes, direction, and rotational speeds.

Learning Visual Motion Segmentation Using Event Surfaces

Anton Mitrokhin,   Zhiyuan Hua,   Cornelia Fermüller,   Yiannis Aloimonos
Conference on Computer Vision and Pattern Recognition (CVPR) 2020. 

Paper Abstract Video
Event-based cameras have been designed for scene motion perception - their high temporal resolution and spatial data sparsity converts the scene into a volume of boundary trajectories and allows to track and analyze the evolution of the scene in time. Analyzing this data is computationally expensive, and there is substantial lack of theory on dense-in-time object motion to guide the development of new algorithms; hence, many works resort to a simple solution of discretizing the event stream and converting it to classical pixel maps, which allows for application of conventional image processing methods.

In this work we present a Graph Convolutional neural network for the task of scene motion segmentation by a moving camera. We convert the event stream into a 3D graph in (x,y,t) space and keep per-event temporal information. The difficulty of the task stems from the fact that unlike in metric space, the shape of an object in (x,y,t) space depends on its motion and is not the same across the dataset. We discuss properties of the event data with respect to this 3D recognition problem, and show that our Graph Convolutional architecture is superior to PointNet++. We evaluate our method on the state of the art event-based motion segmentation dataset - EV-IMO and perform comparisons to a frame-based method proposed by its authors. Our ablation studies show that increasing the event slice width improves the accuracy, and how subsampling and edge configurations affect the network performance.

Learning sensorimotor control with neuromorphic sensors: Toward hyperdimensional active perception

Anton Mitrokhin,   Peter Sutor,   Cornelia Fermüller,   Yiannis Aloimonos
Science Robotics 4 (30) 2019. 

Paper Abstract Project page
The hallmark of modern robotics is the ability to directly fuse the platform's perception with its motoric ability - the concept often referred to as active perception. Nevertheless, we find that action and perception are often kept in separated spaces, which is a consequence of traditional vision being frame based and only existing in the moment and motion being a continuous entity. This bridge is crossed by the dynamic vision sensor (DVS), a neuromorphic camera that can see the motion. We propose a method of encoding actions and perceptions together into a single space that is meaningful, semantically informed, and consistent by using hyperdimensional binary vectors (HBVs). We used DVS for visual perception and showed that the visual component can be bound with the system velocity to enable dynamic world perception, which creates an opportunity for real-time navigation and obstacle avoidance. Actions performed by an agent are directly bound to the perceptions experienced to form its own memory. Furthermore, because HBVs can encode entire histories of actions and perceptions - from atomic to arbitrary sequences - as constant-sized vectors, autoassociative memory was combined with deep learning paradigms for controls. We demonstrate these properties on a quadcopter drone ego-motion inference task and the MVSEC (multivehicle stereo event camera) dataset.

EVDodge: Embodied AI for High-Speed Dodging on a quadrotor using event cameras

Nitin Sanket,   Chethan Parameshwara,  Chahat Deep Singh,   Cornelia Fermüller,   Davide Scaramuzza   Yiannis Aloimonos
IEEE International Conference on Robotics and Automation (ICRA) 2020. 

Paper Abstract Project page
The human fascination to understand ultra-efficient agile flying beings like birds and bees have propelled decades of research on trying to solve the problem of obstacle avoidance on micro aerial robots. However, most of the prior research has focused on static obstacle avoidance. This is due to the lack of high-speed visual sensors and scalable visual algorithms. The last decade has seen an exponential growth of neuromorphic sensors which are inspired by nature and have the potential to be the de facto standard for visual motion estimation problems.

After re-imagining the navigation stack of a micro air vehicle as a series of hierarchical competences, we develop a purposive artificial intelligence based formulation for the problem of general navigation. We call this AI framework "Embodied AI" - AI design based on the knowledge of agent's hardware limitations and timing/computation constraints. Following this design philosophy we develop a complete AI navigation stack for dodging multiple dynamic obstacles on a quadrotor with a monocular event camera and computation. We also present an approach to directly transfer the shallow neural networks trained in simulation to the real world by subsuming pre-processing using a neural network into the pipeline.

We successfully evaluate and demonstrate the proposed approach in many real-world experiments with obstacles of different shapes and sizes, achieving an overall success rate of 70% including objects of unknown shape and a low light testing scenario. To our knowledge, this is the first deep learning based solution to the problem of dynamic obstacle avoidance using event cameras on a quadrotor. Finally, we also extend our work to the pursuit task by merely reversing the control policy proving that our navigation stack can cater to different scenarios.

EV-IMO: Motion Segmentation Dataset and Learning Pipeline for Event Cameras

Anton Mitrokhin,   ChengXi Ye,   Cornelia Fermüller,   Yiannis Aloimonos   Tobi Delbruck
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2019. 

Paper Abstract Project page Dataset Code
We present the first event-based learning approach for motion segmentation in indoor scenes and the first event-based dataset EV-IMO which includes accurate pixel-wise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion and a dense depth map, the network estimates pixel-wise independently moving object segmentation and computes per-object 3D translational velocities for moving objects. Additionally, we train a shallow network with just 40k parameters, which is able to compute depth and egomotion.

Our EV-IMO dataset features 32 minutes of indoor recording with 1 to 3 fast moving objects simultaneously on the camera frame. The objects and the camera are tracked by the VICON motion capture system. We use 3D scans of the room and objects to obtain accurate depth map ground truth and pixel-wise object masks, which are reliable even in poor lighting conditions and during fast motion. We then train and evaluate our learning pipeline on EV-IMO and demonstrate that our approach far surpasses its rivals and is well suited for scene constrained robotics applications.

Unsupervised Learning of Dense Optical Flow and Depth from Sparse Event Data

ChengXi Ye,   Anton Mitrokhin,   Cornelia Fermüller,  James A. Yorke and Yiannis Aloimonos
arXiv 2019. 

Paper Abstract Project page
In this work we present a lightweight, unsupervised learning pipeline for dense depth, optical flow and egomotion estimation from sparse event output of the Dynamic Vision Sensor (DVS). To tackle this low level vision task, we use a novel encoder-decoder neural network architecture - ECN.

Our work is the first monocular pipeline that generates dense depth and optical flow from sparse event data only. The network works in self-supervised mode and has just 150k parameters. We evaluate our pipeline on the MVSEC self driving dataset and present results for depth, optical flow and and egomotion estimation. Due to the lightweight design, the inference part of the network runs at 250 FPS on a single GPU, making the pipeline ready for realtime robotics applications. Our experiments demonstrate significant improvements upon previous works that used deep learning on event data, as well as the ability of our pipeline to perform well during both day and night.

Event-based Moving Object Detection and Tracking

Anton Mitrokhin,  Cornelia Fermüller,  Chethan M Parameshwara and Yiannis Aloimonos
IEEE International Conference on Intelligent Robots (IROS) 2018. 

Paper Abstract Project page
Event-based vision sensors, such as the Dynamic Vision Sensor (DVS), are ideally suited for real-time motion analysis. The unique properties encompassed in the readings of such sensors provide high temporal resolution, superior sensitivity to light and low latency. These properties provide the grounds to estimate motion extremely reliably in the most sophisticated scenarios but they come at a price - modern eventbased vision sensors have extremely low resolution and produce a lot of noise. Moreover, the asynchronous nature of the event stream calls for novel algorithms. This paper presents a new, efficient approach to object tracking with asynchronous cameras. We present a novel event stream representation which enables us to utilize information about the dynamic (temporal) component of the event stream, and not only the spatial component, at every moment of time. This is done by approximating the 3D geometry of the event stream with a parametric model; as a result, the algorithm is capable of producing the motion-compensated event stream (effectively approximating egomotion), and without using any form of external sensors in extremely low-light and noisy conditions without any form of feature tracking or explicit optical flow computation. We demonstrate our framework on the task of independent motion detection and tracking, where we use the temporal model inconsistencies to locate differently moving objects in challenging situations of very fast motion.

Real-time clustering and multi-target tracking using event-based sensors

Francisco Barranco,  Cornelia Fermüller,  and Eduardo Ros
IEEE International Conference on Intelligent Robots (IROS) 2018. 

Paper Abstract
Clustering is crucial for many computer vision applications such as robust tracking, object detection and segmentation. This work presents a real-time clustering technique that takes advantage of the unique properties of eventbased vision sensors. Since event-based sensors trigger events only when the intensity changes, the data is sparse, with low redundancy. Thus, our approach redefines the well-known mean-shift clustering method using asynchronous events instead of conventional frames. The potential of our approach is demonstrated in a multi-target tracking application using Kalman filters to smooth the trajectories. We evaluated our method on an existing dataset with patterns of different shapes and speeds, and a new dataset that we collected. The sensor was attached to the Baxter robot in an eye-in-hand setup monitoring real-world objects in an action manipulation task. Clustering accuracy achieved an F-measure of 0.95, reducing the computational cost by 88% compared to the frame-based method. The average error for tracking was 2.5 pixels and the clustering achieved a consistent number of clusters along time.

A dataset for visual navigation with neuromorphic methods

Francisco Barranco, Cornelia Fermüller, Yiannis Aloimonos, and Tobi Delbruck
Frontiers in Neuroscience, 10, 49, 2018. 

Paper Abstract Project page
Standardized benchmarks in Computer Vision have greatly contributed to the advance of approaches to many problems in the field. If we want to enhance the visibility of event-driven vision and increase its impact, we will need benchmarks that allow comparison among different neuromorphic methods as well as comparison to Computer Vision conventional approaches. We present datasets to evaluate the accuracy of frame-free and frame-based approaches for tasks of visual navigation. Similar to conventional Computer Vision datasets, we provide synthetic and real scenes, with the synthetic data created with graphics packages, and the real data recorded using a mobile robotic platform carrying a dynamic and active pixel vision sensor (DAVIS) and an RGB+Depth sensor. For both datasets the cameras move with a rigid motion in a static scene, and the data includes the images, events, optic flow, 3D camera motion, and the depth of the scene, along with calibration procedures. Finally, we also provide simulated event data generated synthetically from well-known frame-based optical flow datasets.

Contour Detection and Characterization for Asynchronous Event Sensors

Francisco Barranco, Ching L. Teo, Cornelia Fermüller, Yiannis Aloimonos
IEEE International Conference on Computer Vision (ICCV), 2015. 

Paper Abstract Project page
The bio-inspired, asynchronous event-based dynamic vision sensor records temporal changes in the luminance of the scene at high temporal resolution. Since events are only triggered at significant luminance changes, most events occur at the boundary of objects and their parts. The detection of these contours is an essential step for further interpretation of the scene. This paper presents an approach to learn the location of contours and their border ownership using Structured Random Forests on event-based features that encode motion, timing, texture, and spatial orientations. The classifier integrates elegantly information over time by utilizing the classification results previously computed. Finally, the contour detection and boundary assignment are demonstrated in a layer-segmentation of the scene. Experimental results demonstrate good performance in boundary detection and segmentation.

Bio-inspired Motion Estimation with Event-Driven Sensors

Francisco Barranco, Cornelia Fermüller, and Yiannis Aloimonos
Advances in Computational Intelligence, 309-321, Springer International Publishing, 2015. 

Paper Abstract
This paper presents a method for image motion estimation for event-based sensors. Accurate and fast image flow estimation still challenges Computer Vision. A new paradigm based on asynchronous event-based data provides an interesting alternative and has shown to provide good estimation at high contrast contours by estimating motion based on very accurate timing. However, these techniques still fail in regions of high-frequency texture. This work presents a simple method for locating those regions, and a novel phase-based method for event sensors that estimates more accurately these regions. Finally, we evaluate and compare our results with other state-of-the-art techniques.

Contour Motion Estimation for Asynchronous Event-Driven Cameras

Francisco Barranco, Cornelia Fermüller, Yiannis Aloimonos
Proceedings of the IEEE, 102, 10, 1537-1556 

Paper Abstract
This paper compares image motion estimation with asynchronous event-based cameras to Computer Vision approaches using as input frame-based video sequences. Since dynamic events are triggered at significant intensity changes, which often are at the border of objects, we refer to the eventbased image motion as ‘‘contour motion.’’ Algorithms are presented for the estimation of accurate contour motion from local spatio–temporal information for two camera models: the dynamic vision sensor (DVS), which asynchronously records temporal changes of the luminance, and a family of new sensors which combine DVS data with intensity signals. These algorithms take advantage of the high temporal resolution of the DVS and achieve robustness using a multiresolution scheme in time. It is shown that, because of the coupling of velocity and luminance information in the event distribution, the image motion estimation problem becomes much easier with the new sensors which provide both events and image intensity than with the DVS alone. Experiments on synthesized data from computer vision benchmarks show that our algorithm on combined data outperforms computer vision methods in accuracy and can achieve real-time performance, and experiments on real data confirm the feasibility of the approach. Given that current image motion (or so-called optic flow) methods cannot estimate well at object boundaries, the approach presented here could be used complementary to optic flow techniques, and can provide new avenues for computer vision motion research.