A visual system in order to successfully navigate in its environment and understand the visible world must possess a set of basic capabilities. This thesis describes the design and the development of the processes responsible for the estimation of egomotion (the system's motion) and object motion which are a prerequisite for the accomplishment of any other navigational task. For a monocular observer capable of actively controlling the geometric parameters of its sensory apparatus, it is shown how different activities facilitate the interpretation of visual motion. The basic idea of the object motion estimation strategy lies in the employment of fixation and tracking. Fixation simplifies much of the computation by placing the object at the center of the visual field, and the main advantage of tracking is the accumulation of information over time. For the problem of egomotion estimation new constraints of a global nature relating 2-D image measurements to 3-D motion parameters are presented. Local image measurements form global patterns in the image plane and the position of these patterns determines the 3-D motion parameters. The algorithms developed are provably robust because the constraints employed are global and qualitative and neither correspondence nor optical flow is utilized as input, but only the spatio-temporal derivatives of the image intensity function.

Our work on Active Vision has recently focused on the computational modelling of navigational tasks, where our investigations were guided by the idea of approaching vision for behavioral systems in form of modules that are directly related to perceptual tasks. These studies led us to branch in various directions and inquire into the problems that have to be addressed in order to obtain an overall understanding of perceptual systems. In this paper we present our views about the architecture of vision systems, about how to tackle the design and analysis of perceptual systems, and promising future research directions. Our suggested approach for understanding behavioral vision to realize the relationship of perception and action builds on two earlier approaches, the Medusa philosophy [3] and the Synthetic approach [15]. The resulting framework calls for synthesizing an artificial vision system by studying vision competences of increasing complexity and at the same time pursuing the integration of the perceptual components with action and learning modules. We expect that Computer Vision research in the future will progress in tight collaboration with many other disciplines that are concerned with empirical approaches to vision, i.e. the understanding of biological vision. Throughout the paper we describe biological findings that motivate computational arguments which we believe will influence studies of Computer Vision in the near future.

Image displacement fields--optical flow fields, stereo disparity fields, normal flow fields--due to rigid motion possess a global geometric structure which is independent of the scene in view. Motion vectors of certain lengths and directions are constrained to lie on the imaging surface at particular loci whose location and form depends solely on the 3D motion parameters. If optical flow fields or stereo disparity fields are considered, then equal vectors are shown to lie on conic sections. Similarly, for normal motion fields, equal vectors lie within regions whose boundaries also constitute conics. By studying various properties of these curves and regions and their relationships, a characterization of the structure of rigid motion fields is given. The goal of this paper is to introduce a concept underlying the global structure of image displacement fields. This concept gives rise to various constraints that could form the basis of algorithms for the recovery of visual information from multiple views.

Geometric considerations suggest that the problem of estimating a system's three-dimensional (3D) motion from a sequence of images, which has puzzled researchers in the fields of Computational Vision and Robotics as well as the Biological Sciences, can be addressed as a pattern recognition problem. Information for constructing the relevant patterns is found in spatial arrangements or gratings, that is, aggregations of orientations along which retinal motion information is estimated. The exact form of the gratings is defined by the shape of the retina or imaging surface; for a planar retina they are radial lines, concentric circles, as well as elliptic and hyperbolic curves, while for a spherical retina they become longitudinal and latitudinal circles for various axes. Considering retinal motion information computed normal to these gratings, patterns are found that have encoded in their shape and location on the retina subsets of the 3D motion parameters. The importance of these patterns is first that they depend only on the 3D motion and not on the scene in view, thus providing globally a separation of the effects of 3D motion and scene structure on the image motion, and second that they are founded upon easily derivable image measurements---they do not utilize exact retinal motion measurements such as optical flow, but only the sign of image motion along a set of directions defined by the gratings. The computational theory presented in this paper explains how the self-motion of a system can be estimated by locating these patterns. We also conjecture that this theory or variations of it might be implemented in nature and call for experiments in the neurosciences.

The study of visual navigation problems requires the integration of visual processes with motor control processes. Most essential in approaching this integration is the study of appropriate spatio-temporal representations which the system computes from the imagery and which serve as interfaces to all cognitive and motor activities. Since representations resulting from exact quantitative reconstruction have turned out to be very hard to obtain, we argue here for the necessity of of representations which can be computed easily, reliably and in real time and which recover only the information about the 3D world which is really needed in order to solve the navigational problems at hand. In this paper we introduce a number of such representations capturing aspects of 3D motion and scene structure which are used for the solution of navigational problems implemented in visual servo systems.

If instead of the full motion field, we consider only the direction of the motion field due to a rigid motion, what can we say about the three-dimensional motion information contained in it? This paper provides a geometric analysis of this question based solely on the constraint that the depth of the surfaces in view is positive.

It is shown that, considering as the imaging surface the whole
sphere, independently of the scene in view, two different rigid
motions cannot give rise to the same directional motion field. If we
restrict the image to half of a sphere (or an infinitely large image
plane) two different rigid motions with instantaneous translational
and rotational velocities **(t1, omega1)** and
**(t2, omega2)** cannot give rise to the same directional
motion field unless the plane through **t1** and
**t2** is perpendicular to the plane through
**omega1** and **omega2** (i.e., **(t1 x
t2) . (omega1 x omega2) = 0**). In addition, in order to give
practical significance to these uniqueness results for the case of a
limited field of view, we also characterize the locations on the image
where the motion vectors due to the different motions must have
different directions.

If **(omega1 x omega2) . (t1 x t2) = 0** and certain
additional constraints are met, then the two rigid motions could
produce motion fields with the same direction. For this to happen the
depth of each corresponding surface has to be within a certain range,
defined by a second and a third order surface. Finally, as a
byproduct of the analysis it is shown that if we also consider the
constraint of positive depth the full motion field on a half sphere
uniquely constrains 3D motion independently of the scene in view.

A sequence of images acquired by a moving sensor contains information about the three-dimensional motion of the sensor and the shape of the imaged scene. Interesting research during the past few years has attempted to characterize the errors that arise in computing 3D motion (egomotion estimation) as well as the errors that result in the estimation of the scene's structure (structure from motion). Previous research is characterized by the use of optic flow or correspondence of features in the analysis as well as by the employment of particular algorithms and models of the scene in recovering expressions for the resulting errors. This paper presents a geometric framework that characterizes the relationship between 3D motion and shape when they are both corrupted by errors. We examine how the three-dimensional space recovered by a moving monocular observer, whose 3D motion is estimated with some error, is distorted. We characterize the space of distortions by its level sets, that is, we characterize the systematic distortion via a family of iso-distortion surfaces, each of which describes the locus over which the depths of points in the scene in view are distorted by the same multiplicative factor. The framework introduced in this way has a number of applications: Since the visible surfaces have positive depth (visibility constraint), by analyzing the geometry of the regions where the distortion factor is negative, that is, where the visibility constraint is violated, we make explicit situations which are likely to give rise to ambiguities in motion estimation, independent of the algorithm used. We provide a uniqueness analysis for 3D motion analysis from normal flow. We study the constraints on egomotion, object motion and depth for an independently moving object to be detectable by a moving observer, and we offer a quantitative account of the precision needed in an inertial sensor for accurate estimation of 3D motion.

We are surrounded by surfaces that we perceive by visual means. Understanding the basic principles behind this perceptual process is a central theme in visual psychology, psychophysics and computational vision. In many of the computational models employed in the past, it has been assumed that a metric representation of physical space can be derived by visual means. Psychophysical experiments, as well as computational considerations, can convince us that the perception of space and shape has a much more complicated nature, and that only a distorted version of actual, physical space can be computed. This paper develops a computational geometric model that explains why such distortion might take place. The basic idea is that, both in stereo and motion, we perceive the world from multiple views. Given the rigid transformation between the views and the properties of the image correspondence, the depth of the scene can be obtained. Even a slight error in the rigid transformation parameters causes distortion of the computed depth of the scene. The unified framework introduced here describes this distortion in computational terms. We characterize the space of distortions by its level sets, that is, we characterize the systematic distortion via a family of iso-distortion surfaces which describes the locus over which depths are distorted by some multiplicative factor. Given that humans' estimation of egomotion or estimation of the extrinsic parameters of the stereo apparatus is likely to be imprecise, the framework is used to explain a number of psychophysical experiments on the perception of depth from motion or stereo.

If 3D rigid motion can be correctly estimated from image sequences, the structure of the scene can be correctly derived using the equations for image formation. However, an error in the estimation of 3D motion will result in the computation of a distorted version of the scene structure. Of computational interest are these regions in space where the distortions are such that the depths become negative, because in order for the scene to be visible it has to lie in front of the image, and thus the corresponding depth estimates have to be positive. The stability analysis for the structure from motion problem presented in this paper investigates the optimal relationship between the errors in the estimated translational and rotational parameters of a rigid motion that results in the estimation of a minimum number of negative depth values. The input used is the value of the flow along some direction, which is more general than optic flow or correspondence. For a planar retina it is shown that the optimal configuration is achieved when the projections of the translational and rotational errors on the image plane are perpendicular. Furthermore, the projection of the actual and the estimated translation lie on a line through the center. For a spherical retina, given a rotational error, the optimal translation is the correct one, while given a translational error the optimal rotational error is normal to the translational one at an equal distance from the real and estimated translations. The proofs, besides illuminating the confounding of translation and rotation in structure from motion, have an important application to ecological optics. The same analysis provides a computational explanation of why it is easier to estimate self-motion in the case of a spherical retina and why it is easier to estimate shape in the case of a planar retina, thus suggesting that nature's design of compound eyes (or panoramic vision) for flying systems and camera-type eyes for primates (and other systems that perform manipulation) is optimal.

A computational explanation of the illusory movement
experienced upon extended viewing of *Enigma *, a static figure
painted by Leviant, is presented. The explanation relies on a model
for the interpretation of three-dimensional motion information
contained in retinal motion measurements. This model shows that the
*Enigma * figure is a special case of a larger class of figures
exhibiting the same illusory movement and these figures are introduced
here. Our explanation supports the view that higher-level processes
are the cause of these illusions \cite{Zekietal93} and suggests a
number of new experiments to unravel the functional structure of the
motion pathway.

In the literature we find two classes of algorithms which, on the basis of two views of a scene, recover the rigid transformation between the views and subsequently the structure of the scene. The first class contains techniques which require knowledge of the correspondence or the motion field between the images and are based on the epipolar constraint. The second class contains so-called direct algorithms which require knowledge about the value of the flow in one direction only and are based on the positive depth constraint. Algorithms in the first class achieve the solution by minimizing a function representing deviation from the epipolar constraint while direct algorithms find the 3D motion that, when used to estimate depth, produces a minimum number of negative depth values. This paper presents a stability analysis of both classes of algorithms. The formulation is such that it allows comparison of the robustness of algorithms in the two classes as well as within each class. Specifically, a general statistical model is employed to express the functions which measure the deviation from the epipolar constraint and the number of negative depth values, and these functions are studied with regard to their topographic structure, specifically as regards the errors in the 3D motion parameters at the places representing the minima of the functions. The analysis shows that for algorithms in both classes which estimate all motion parameters simultaneously, the obtained solution has an error such that the projections of the translational and rotational errors on the image plane are perpendicular to each other. Furthermore, the estimated projection of the translation on the image lies on a line through the origin and the projection of the real translation. In the case of algorithms based on the epipolar constraint the solution for translation is biased toward the center of the image while in the case of direct algorithms it is biased away from the center. The analysis also makes it possible to compare properties of algorithms that first estimate the translation and on the basis of the translational result estimate the rotation, algorithms that do the opposite, and algorithms that estimate all motion parameters simultaneously. The framework introduced here, besides serving as a tool that enables the robustness analysis of 3D motion estimation algorithms, provides a formal relationship between a large number of published motion estimation algorithms.

This study investigates the problem of estimating camera calibration parameters from image motion fields induced by a rigidly moving camera with unknown parameters, where the image formation is modeled with a linear pinhole-camera model. The equations obtained show the flow to be separated into a component due to the translation and the calibration parameters and a component due to the rotation and the calibration parameters. A set of parameters encoding the latter component is linearly related to the flow, and from these parameters the calibration can be determined.

However, as for discrete motion, in general it is not possible to decouple image measurements obtained from only two frames into translational and rotational components. Geometrically, the ambiguity takes the form of a part of the rotational component being parallel to the translational component, and thus the scene can be reconstructed only up to a projective transformation. In general, for full calibration at least four successive image frames are necessary, with the 3D rotation changing between the measurements.

This geometric analysis gives rise to a direct self-calibration method that avoids computation of optical flow or point correspondences and uses only normal flow measurements. In this technique the direction of translation is estimated using smoothness constraints in a novel way. The calibration parameters are then estimated from the rotational components of several flow fields using Levenberg-Marquardt parameter estimation, iterative in the calibration parameters only. The proposed technique does not require calibration objects in the scene or special camera motions and it also avoids the computation of exact correspondence. This makes it suitable for the calibration of active vision systems which have to acquire knowledge about their intrinsic parameters while they perform other tasks, or as a tool for analyzing image sequences in large video databases.

Since estimation of camera motion requires knowledge of independent motion, and moving object detection and localization requires knowledge about the camera motion, the two problems of motion estimation and segmentation need to be solved together in a synergistic manner. This paper provides an approach to treating both these problems for a binocular observer. The technique introduced here is based on a novel concept, ``scene smoothness,'' which parameterizes the variation in estimated scene depth with the error in the underlying 3D motion. The idea is that incorrect 3D motion estimates cause distortions in the estimated depth map, and as a result smooth scene patches are computed as unsmooth, i.e., rugged, surfaces. The correct 3D motion can be distinguished, as it does not cause any distortion and thus gives rise to the smoothest background patches, with the locations corresponding to independent motion being unsmooth. The observer's binocular nature is exploited in the extraction of depth discontinuities, a step that facilitates the overall procedure.

A pattern by Ouchi (Figure 1) has the surprising property that small motions can cause illusory relative motion between the inset and background regions. The effect can be attained with small retinal motions or a slight jiggling of the paper and is robust over large changes in the patterns, frequencies and boundary shapes. In this paper, we explain that the cause of the illusion lies in the statistical difficulty of integrating local one-dimensional motion signals into two-dimensional image velocity measurements. The estimation of image velocity generally is biased, and for the particular spatial gradient distributions of the Ouchi pattern the bias is highly pronounced, giving rise to a large difference in the velocity estimates in the two regions. The computational model introduced to describe the statistical estimation of image velocity also accounts for the findings of psychophysical studies with variations of the Ouchi pattern and for various findings on the perception of moving plaids. The insight gained from this computational study challenges the current models used to explain biological vision systems and to construct robotic vision systems. Considering the statistical difficulties in image velocity estimation in conjunction with the problem of discontinuity detection in motion fields suggests that theoretically the process of optical flow computation should not be carried out in isolation but in conjunction with the higher-level processes of 3D motion estimation, segmentation and shape computation.

Natural or artificial vision systems process the images that they collect with their eyes or cameras in order to derive information for performing tasks related to navigation and recognition. Since the way images are acquired determines how difficult it is to perform a visual task, and since systems have to cope with limited resources, the eyes used by a specific system should be designed to optimize subsequent image processing as it relates to particular tasks. Different ways of sampling light, i.e., different eyes, may be less or more powerful with respect to particular competences. This seems intuitively evident in view of the variety of eye designs in the biological world. It is shown here that a spherical eye (an eye or system of eyes providing panoramic vision) is superior to a camera-type eye (an eye with restricted field of view) as regards the competence of three-dimensional motion estimation. This result is derived from a statistical analysis of all the possible computational models that can be used for estimating 3D motion from an image sequence. The findings explain biological design in a mathematical manner, by showing that systems that fly and thus need good estimates of 3D motion gain advantages from panoramic vision. Also, insights obtained from this study point to new ways of constructing powerful imaging devices that suit particular tasks in robotics, visualization and virtual reality better than conventional cameras, thus leading to a new camera technology.

When processing image sequences some representation of image motion must be derived as a first stage. The most often used such representation is the optical flow field, which is a set of velocity measurements of image patterns. It is well known that it is very difficult to estimate accurate optical flow at locations in an image which correspond to scene discontinuities. What is less well known, however, is that even at the locations corresponding to smooth scene surfaces, the optical flow field often cannot be estimated accurately.

Noise in the data causes many optical flow estimation techniques to give biased flow estimates. Very often there is consistent bias: the estimate tends to be an underestimate in length and to be in a direction closer to the majority of the gradients in the patch. This paper studies all three major categories of flow estimation methods -- gradient-based, energy-based, and correlation methods, and it analyzes different ways of compounding one-dimensional motion estimates (image gradients, spatiotemporal frequency triplets, local correlation estimates) into two-dimensional velocity estimates, including linear and nonlinear methods.

Correcting for the bias would require knowledge of the noise parameters. In many situations, however, these are difficult to estimate accurately, as they change with the dynamic imagery in unpredictable and complex ways. Thus, the bias really is a problem inherent to optical flow estimation. We argue that the bias is also integral to the human visual system. It is the cause of the illusory perception of motion in the Ouchi pattern (Figure 1) and also explains various psychophysical studies of the perception of moving plaids.

Models of real-world objects and actions for use in graphics, virtual and augmented reality and related fields can only be obtained through the use of visual data and particularly video. This paper examines the question of recovering shape models from video information. Given video of an object or a scene captured by a moving camera, a prerequisite for model building is to recover the three-dimensional (3D) motion of the camera which consists of a rotation and a translation at each instant. It is shown here that a spherical eye (an eye or system of eyes providing panoramic vision) is superior to a camera-type eye (an eye with restricted field of view such as a common video camera) as regards the competence of 3D motion estimation. This result is derived from a geometric/statistical analysis of all the possible computational models that can be used for estimating 3D motion from an image sequence. Regardless of the estimation procedure for a camera-type eye, the parameters of the 3D rigid motion (translation and rotation) contain errors satisfying specific geometric constraints. Thus, translation is always confused with rotation, resulting in inaccurate results. This confusion does not happen for the case of panoramic vision. Insights obtained from this study point to new ways of constructing powerful imaging devices that suit particular tasks in visualization and virtual reality better than conventional cameras, thus leading to a new camera technology. Such new eyes are constructed by putting together multiple existing video cameras in specific ways, thus obtaining eyes from eyes. For a new eye of this kind we describe an implementation for deriving models of scenes from video data, while avoiding the correspondence problem in the video sequence.

This paper examines the inherent difficulties in observing 3D rigid motion from image sequences. It does so without considering a particular estimator. Instead, it presents a statistical analysis of all the possible computational models which can be used for estimating 3D motion from an image sequence. These computational models are classified according to the mathematical constraints that they employ and the characteristics of the imaging sensor (restricted field of view and full field of view). Regarding the mathematical constraints, there exist two principles relating a sequence of images taken by a moving camera. One is the ``epipolar constraint,'' applied to motion fields, and the other the ``positive depth'' constraint, applied to normal flow fields. 3D motion estimation amounts to optimizing these constraints over the image. A statistical modeling of these constraints leads to functions which are studied with regard to their topographic structure, specifically as regards the errors in the 3D motion parameters at the places representing the minima of the functions. For conventional video cameras possessing a restricted field of view, the analysis shows that for algorithms in both classes which estimate all motion parameters simultaneously, the obtained solution has an error such that the projections of the translational and rotational errors on the image plane are perpendicular to each other. Furthermore, the estimated projection of the translation on the image lies on a line through the origin and the projection of the real translation. The situation is different for a camera with a full (360 degree) field of view (achieved by a panoramic sensor or by a system of conventional cameras). In this case, at the locations of the minima of the above two functions, either the translational or the rotational error becomes zero, while in the case of a restricted field of view both errors are non-zero. Although some ambiguities still remain in the full field of view case, the implication is that visual navigation tasks, such as visual servoing, involving 3D motion estimation are easier to solve by employing panoramic vision. Also, the analysis makes it possible to compare properties of algorithms that first estimate the translation and on the basis of the translational result estimate the rotation, algorithms that do the opposite, and algorithms that estimate all motion parameters simultaneously, thus providing a sound framework for the observability of 3D motion. Finally, the introduced framework points to new avenues for studying the stability of image-based servoing schemes. <\P>

It is proposed in this paper that many geometrical optical illusions, as well as illusionary patterns due to motion signals in line drawings, are due to the statistics of visual computations. The interpretation of image patterns is preceded by a step where image features such as lines, intersections of lines, or local image movement must be derived. However, there are many sources of noise or uncertainty in the formation and processing of images, and they cause problems in the estimation of these features; in particular, they cause bias. As a result the locations of features are perceived erroneously and the appearance of the patterns is altered. The bias occurs with any visual processing of line features; under average conditions it is not large enough to be noticeable, but illusionary patterns are such that the bias is highly pronounced. Thus the broader message of this paper is that there is a general principle which governs the workings of vision systems, and optical illusions are an artifact of this principle.

We investigate the relationship between camera design and the problem of recovering the motion and structure of a scene from video data. The visual information that could possibly be obtained is described by the plenoptic function. A camera can be viewed as a device that captures a subset of this function, that is, it measures some of the light rays in some part of the space. The information contained in the subset determines how difficult it is to solve subsequent interpretation processes. By examining the differential structure of the time varying plenoptic function we relate different known and new camera models to the spatio-temporal structure of the observed scene. This allows us to define a hierarchy of camera designs, where the order is determined by the stability and complexity of the computations necessary to estimate structure and motion. At the low end of this hierarchy is the standard planar pinhole camera for which the structure from motion problem is non-linear and ill-posed. At the high end is a camera , which we call the full field of view plenoptic camera, for which the problem is linear and stable. In between are multiple view cameras with a large field of view which we have built, as well as catadioptric panoramic sensors and other omni-directional cameras. We develop design suggestions for the plenoptic camera, and based upon this new design we propose a linear algorithm for ego-motion estimation, which in essence combines differential motion estimation with differential stereo.