Yiannis Aloimonos

My research on Active Vision has concentrated on:

(a): the problem of segmenting a scene into different surfaces through the integration of visual cues. This is the heart of the vision processes underlying the perception of space and objects.

(b): the interpretation and understanding of action

(a) COMPOSITIONAL VISION

Chicken-and-egg problems in early and middle-level vision. The visual world can be decomposed into a collection of surfaces separated by boundaries. Such a surface segmentation may be performed with regard to a property (or set of properties) which varies smoothly along a surface, but is discontinuous at the boundary between two surfaces. Properties like the brightness or color of image pixels are sensor

measurements, and we can imagine a process which finds discontinuities in these properties. However, segmenting an image using properties such as depth, motion, or texture is not directly possible, since these properties are not direct sensor measurements, but are quantities to be deduced. In fact, it turns out that deducing these quantities depends on the segmentation which they would create; therefore, such property estimation and property-based segmentation problems form chicken-and-egg pairs.

Some examples are: texture and texture-based segmentation, disparity and disparity-based segmentation from stereo images, optical flow and 2D motion-based segmentation from video, 3D motion and 3D motion-based segmentation from video.

In fact, the interdependence among various sub problems is not just restricted to a few chicken-and-egg pairs of problems, but is ubiquitous as indicated by the web of relationships in the figure above that describes the flow of information among various early modules related to the correspondence problem. The architecture underlying this figure has been pieced together using our work in the past several years as well as a number of results from the community. The PhD dissertation of Abhijit Ogale and the results of

Cornelia Fermuller on Uncertainty in Visual Processes were the important components of this architecture.

First, various measurements such as color, intensity, and filter responses are gathered from the input images. The filter responses can be used as inputs to a texture analysis module, as well as local evidence for binocular stereo and motion correspondence modules. Stereo disparity and optical flow both affect and are affected by the estimation of depth discontinuities, shape and depth; therefore, stereo and optical flow will indirectly interact with each other. Shape from X (texture, shading) also influences the estimation of 3D shape. The estimation of depth discontinuities is also affected in certain directions by image discontinuities such as intensity, color or texture discontinuities. Image discontinuities in turn may affect the perception of intensity, color and texture as well, which is indicated by certain visual illusions. Besides depth discontinuities, optical flow is also related to 3D motion discontinuities (i.e., boundaries of independently moving objects) and 3D motion estimation. Ultimately, computed shape feeds back to affect the interpretation of local measurements and samples, and we return back to where we started!

The above scenario can be expanded to contain many more modules, interactions and dependencies than those just described. To solve such chicken-and-egg problems, a compositional approach is required. Compositionality implies seeking joint solutions to interdependent problems, possibly using explicit or implicit feedback; it implies a principle of gradual collaborative commitment to solutions and prevents individual problems from committing too early to myopic solutions without ‘discussion’ with other problems. We are currently working towards the integration of early vision (completing the above figure). Check the page of my research associate Abhijit Ogale for results, demos, papers and code.

(b) A LANGUAGE FOR ACTION

We have been working for some time on recognizing human action from video. This led us to

the realization a few years ago that human action is characterized by a language. In

fact, actions are represented in three spaces: the visual space, the motor space

and the natural language (i.e., linguistic) space. Therefore, we can imagine that actions

possess at least three languages: a visual language, a motor language, and natural

language. The visual language allows us to see and understand actions, the motor language

allows us to produce actions, and the natural language allows us to talk about actions. We

are developing the phonology, morphology and syntax of these different languages as well

as their relationships.

For results and papers, visit The Grammars of Human Behavior

Here is the abstract of a talk I gave recently at Yale CS/Engineering and at the Cognitive Science

& Human Development Dept. at Vanderbilt.

ARISTOTLE’s DREAM: A LANGUAGE FOR ACTION

The behaviorome project?

I present a roadmap to a language for symbolic manipulation of visual and motor information in a sensori-motor system model. The visual information is processed in

a perception subsystem which translates a visual representation of action into our visuo-motor language. We measured a large number of human actions in order to have empirical data that could validate and support our embodied language for movement and activity. The embodiment of the language serves as interface between visual perception and the motor subsystem responsible for action execution. The visuomotor language is defined using a linguistic approach. In phonology, we define basic atomic segments that are used to compose human activity. Phonological rules are modeled as

a finite automaton. In morphology, we study how visuomotor phonemes are combined to form strings representing human activity and to generate a higher-level morphological grammar. This compact grammar suggests the existence of lexical units working as visuo-motor subprograms. In syntax, we present a model for visuo-motor sentence construction where the subject corresponds to the active joints (noun) modified by a posture (adjective). A verbal phrase involves the representation of the human activity (verb) and timing coordination among different joints (adverb). In simple terms, human action, whether visual or motoric, is characterized by a language that we can empirically discover. Proper interfacing among these languages and a spoken language (e.g. English) gives rise to the human conceptual system.

The theory has a number of implications in Engineering (sensor networks and surveillance systems), cognitive systems and AI (the grounding problem), Linguistics (the universal grammar). In the second part of the talk I will describe the behaviorome project, in which using recent technology for measuring sensorimotor activity we build a sensorimotor corpus. I will discuss how using this corpus and learning technology we could obtain (learn) the human conceptual system, i.e. ultimately solve the grounding problem.