# The Conventional Wisdom

In the bulk of the literature the problem is still studied in the same way that photogrammeters first approached it at the beginning of the century. Given two positions of the eye or camera, there exist two concepts of interest: (a) the 3D rigid transformation relating the coordinate systems of the two viewing positions, consisting of the sum of a rotation and a translation; and (b) the 2D transformation relating the images. This second concept is usually taken to be the correspondence between features in the two images that are the projections of the same feature in the scene, or, equivalently, the velocity with which image points move, the "motion field" (whose estimate is referred to as optic flow). Given the correspondence or flow, the 3D transformation can be computed, and subsequently finding a model for the scene is easy. The mathematics of this approach were worked out by Longuet-Higgins, Huang and his group, and J.K. Aggarwal and his group for two views and point correspondences; and by Spetsakis and Aloimonos for three views and correspondences of points or lines. Additional views provide no more geometric information for a static scene. Koenderink, Faugeras, Sparr, Maybank, Hartley, and many others provided insight for computing affine or projective models in this framework, and further developed the mathematics. Koenderink and van Doorn developed the most useful, fundamental insights regarding the relationship of image motion, 3D motion and shape.

In computer vision, a characteristic of the correspondence-based approach is clear separations between structure and motion computation, and between 2D and 3D information. Usually, first 2D-based smoothing constraints are employed to obtain from the image measurements the optical flow field or correspondence; then this information is used to estimate 3D motion and, subsequently, structure. The problem with such an approach is that optical flow or correspondence cannot be computed well on the basis of image measurements only, and erroneously computed optical flow in the sequel leads to errors in 3D motion and structure. There are basically two problems; one arises from the locations of flow discontinuities which are due to scene elements at different depths or differently moving objects. If we knew where the discontinuities were, we could, using a multitude of approaches based on smoothness constraints, estimate flow values for image patches corresponding to smooth scene patches; but to know the discontinuities requires solving for motion and structure first (chicken-egg problem). A second problem arises, which is of a statistical nature: even within areas of smooth scene patches, optical flow cannot be estimated accurately; the estimation is biased and depends on the gradient distribution of the scene texture. This bias is highly pronounced in the pattern designed by Ouchi and explained in our recent work. Slight movements of this pattern produce different movements in the inset and the background. This is an example where accurate flow is impossible to compute. This correspondence-based framework has given rise to some applications, especially ones involving well-structured geometric objects or techniques involving semi-automatic approaches (for example, use of an operator). This framework is approaching its limits. Treating an image sequence as a moving cloud of points has its limitations.