V-TOYS Interactive Toys

V-TOYS: Visually Interactive Toys

This research is a collaboration between Yaser Yacoob (U. of Maryland-College Park) Ismail Haritaoglu, Alex Cozzi and David Koons and Myron Flickner (IBM-Almaden) in the on-going BlueEyes research effort.

Click here for a two-page summary of VTOY performance as of August 2000

Shown at SIGGRAPH 2000-Emerging Technologies

50MB mpeg, Beware!
V-TOYS is a visually interactive robot that understands and reacts to human presence and visual  communication messages.  V-TOY replicates, functionally,  the upper part of  the human body. It has a rotating neck, two controllable eyes, a camera, and controllable deformable eyebrows and a mouth.V-TOY detects when a human walks sufficiently close and then greets the person verbally and visually by displaying a smile. Then, V-TOY introduces and demonstrates its capabilities and invites the user to test how well  it can understand and mimic her conveyed facial expressions.

Eye contact, face expressions and actions play an important role in human communication. In addition to conveying emotions they are employed to augment and control the flow of human interactions. Toys are currently visually blind. They cannot recognize the presence of humans, identify their communicative messages, or react to these messages.  Animators have long recognized the value of facial expressions in making actors convey more realism. Moreover, they often employ  exaggerated animations (cartoon-like animations) to increase the intensity of the viewer's involvement.

The V-TOY robot, PONG,  has a total of 12 servos resulting in 11 1/2 degrees of freedom..  The eyes constructed from ping-pong ball each have two degrees, azimuth and elevation, of rotational freedom.  This enables PONG to establish eye contact with people in the camera's field of view and track them  as they move.  The  eye browses each have one degree of freedom that corresponds to the corrugators supercilii and medial  frontalis muscles.  This allows PONG to raise and lower his eye brows.  The mouth is controlled by 4 servos.  The first two enable the left and right mouth corners to move up and down.  The second two enable the upper and lower middle lip to move up and  down enabling the mouth to be opened and closed.  Finally the neck has two servos controlling the pan and tilt position of the head.  PONG is interfaced using an RS232 port driven from a Java application that allows control of all 12 servos. The video camera is imbedded in the position of the nose.V-TOY is a prototype of a future toy having the following capabilities:

  •  Detection of the presence of humans, their gaze and facial actions.
  •  Controlled neck and eye movements and generation of facial deformations such as mouth and eyebrows deformations.
  • Speaker localization and  speech recognition  for  verbal commands
  •  A small set of behaviors designed to attract the attention of a human.


Face expression recognition

We developed basic techniques for motion analysis of human facial deformations. A system that simultaneously tracks the image  motion of a human face and measures the nonrigid motion of face regions  such as the eyes, eyebrows and mouth was developed  The system was applied initially to the problem of recognition of the so-called six universal facial expressions, originally identified by Darwin and more recently investigated systematically by the psychologists Ekman and Friesen .
The approach was applied to an extensive database of both laboratory image sequences and ``real'' video clips digitized from movies and television talk shows.  The rules employed for recognizing facial expressions were drawn from the psychology literature. 

Gaze recognition and face expression generation

We developed a fast, robust, and low cost pupil detection technique that uses two infra red (IR) time  multiplexed light sources, synchronized with the camera frame rate . One light source  is placed very close to the camera's optical axis, and the second source is placed off-axis. The pupil  appears bright in the camera image during on-axis illumination (similar to the red eye effect from flash  photography), and dark when illumination is off-axis. Our experiments using a real-time implementation of  the system show that this technique is very robust, and able to detect pupils using wide field of view low cost  cameras under different illumination conditions, even for people with glasses.  We combine this pupil detector with scene background models to enhance the robustness.  Additionally we use motion  information to detect when new people appear in the scene allowing for a shift of attention between people.