Estimating disparity with confidence from energy neurons
Eric K. C. Tsang Dept. of Electronic and Computer Engr. Hong Kong Univ. of Sci. and Tech. Kowloon, HONG KONG SAR eeeric@ee.ust.hk Bertram E. Shi Dept. of Electronic and Computer Engr. Hong Kong Univ. of Sci. and Tech. Kowloon, HONG KONG SAR eebert@ee.ust.hk

Abstract
The peak location in a population of phase-tuned neurons has been shown to be a more reliable estimator for disparity than the peak location in a population of position-tuned neurons. Unfortunately, the disparity range covered by a phasetuned population is limited by phase wraparound. Thus, a single population cannot cover the large range of disparities encountered in natural scenes unless the scale of the receptive fields is chosen to be very large, which results in very low resolution depth estimates. Here we describe a biologically plausible measure of the confidence that the stimulus disparity is inside the range covered by a population of phase-tuned neurons. Based upon this confidence measure, we propose an algorithm for disparity estimate that uses many populations of high-resolution phase-tuned neurons that are biased to different disparity ranges via position shifts between the left and right eye receptive fields. The population with the highest confidence is used to estimate the stimulus disparity. We show that this algorithm outperforms a previously proposed coarse-to-fine algorithm for disparity estimation, which uses disparity estimates from coarse scales to select the populations used at higher scales.

1 Introduction
Binocular disparity, the displacement between the image locations of an object between two eyes or cameras, is an important depth cue. Mammalian brains appear to represent the stimulus disparity using populations of disparity-tuned neurons in the visual cortex [1][2]. The binocular energy model is a first order model that explains the responses of individual disparity-tuned neurons [3]. In this model, the preferred disparity tuning of the neurons is determined by the phase and position shifts between the left and right monocular receptive fields (RFs). Peak picking is a common disparity estimation strategy for these neurons([4]-[6]). In this strategy, the disparity estimates are computed by the preferred disparity of the neuron with the largest response among the neural population. Chen and Qian [4] have suggested that the peak location in a population of phase-tuned disparity energy neurons is a more reliable estimate than the peak location in a population of position-tuned neurons. It is difficult to estimate disparity from a single phase-tuned neuron population because its range of preferred disparities is limited. Figure 1 shows the population response of phase-tuned neurons (vertical cross section) for different stimulus disparities. If the stimulus disparity is confined to the range of preferred disparities of this population, the peak location changes linearly with the stimulus disparity. Thus, we can estimate the disparity from the peak. However, in natural viewing condition, the stimulus disparity ranges over ten times larger than the range of the preferred disparities of the population [7]. The peak location no longer indicates the stimulus disparity, since the peaks still occur when the stimulus disparity is outside the range of neurons' preferred disparities. The false peaks arise from two sources: the phase wrap-around due to the sinusoidal modulation in the Gabor


D pref

5 0 -5 -40 -30 -20 -10 0 10 20 30 40

stimulus disparity (pixels)

Fig. 1: Sample population responses of the phase-tuned disparity neurons for different disparities. This was generated by presenting the left image in Figure 5a to both eyes but varying the disparity by keeping the left image fixed and shifting the right image. At each point, the image intensity represents the response of a disparity neuron tuned to a fixed preferred disparity (vertical axis) in response to a fixed stimulus disparity (horizontal axis). The dashed vertical lines indicate the stimulus disparities that fall within the range of preferred disparities of the population ( ą 8 pixels). function modelling neuron's receptive field (RF), or unmatching edges entering the neuron's RF [5]. Although a single population can cover a large disparity range, the size of the required receptive fields is very large and results in very low resolution depth estimates. To address this problem, Chen and Qian [4] proposed a coarse-to-fine algorithm which refines the estimates computed from the coarse scales. Here we present an alternative way to estimate the stimulus disparity by a biologically plausible confidence measure for the population of phase-tuned neurons. This measure indicates whether the stimulus disparity lies inside or outside the range of their preferred disparities. We motivate this measure by examining the empirical statistics of the model neuron responses on natural images. Finally, we demonstrate the efficacy of using this measure to estimate the stimulus disparity. Our model generates better estimates than the coarse-to-fine approach [4], and it can detect occlusions.

2 Features of the phase-tuned disparity population
In this section, we characterize different features of a population of the phase-tuned neurons. These features will be used in the analysis of the confidence measure. Figure 2a illustrates the binocular disparity energy model of a phase-tuned neuron [3]. For simplicity, we assume 1D processing, which is equivalent to considering one orientation in the 2D case. The response of a binocular simple cell is modelled by summing of the outputs of linear monocular Gabor filters applied to both left and right images, followed by a positive or negative half squaring nonlinearity. The response of a binocular complex cell is the sum of the four simple cell responses. Formally, we define the left and right retinal images by U l(x) and U r(x) , where x denotes the distance from the RF center. The disparity d is the difference between the locations of corresponding points in the left and right images, i.e., an object that appears at point x + d in the left image appears at point x in the right image. Pairs of monocular responses are generated by integrating image intensities weighted by pairs of phase quadrature RF profiles, which are the real and imaginary parts of a complex-valued Gabor function ( j = ­ 1 ):

h(x, ) = g(x) e j (  x +  ) = g(x) cos (  x +  ) + j g(x) sin (  x +  )

(1)

where  and  are the spatial frequency and the phase of the left and right monocular RFs, and g(x) is a zero mean Gaussian with standard deviation  , which is inversely proportional to the spatial frequency bandwidth. The spatial frequency and the standard deviation of the left and right RFs are identical, but the phases may differ (  l and  r ). We can compactly express the pairs of j left and right monocular responses as the real and imaginary parts of V l( l) = V l e l and j r , where with a slight abuse of notation, we define V r( r) = V r e

V l =  g(x) e j  x U l(x) dx and V r =  g(x) e j  x U r(x) dx

(2)

The response of the binocular complex cell (the disparity energy) is the squared modulus of the sum of the monocular responses:


(a)
U l(x)

h(x,  l)
Re .

Half Squaring

(b)
E d( ) P

h(x,  r)
Re. Im.

  
­ 

E d( )


S


U r(x)

 Binoc  l = 0  r = -- Simple ular Cells 2

Im.

Bi nocul a r Complex Cell

Fig. 2: (a) Binocular disparity energy model of a disparity neuron in the phase-shift mechanism. The phase-shift  r ­  l between the left and right monocular RFs determines the preferred disparity of the neuron. The neuron shown is tuned to a negative disparity of ­  / ( 2  ) . (b) The population response of the phase-tuned neurons E d( ) centered at a retinal location with the phase-shifts    [ ­ ,  ] can be characterized by three features S, P and   .

E d( ) = V l e j  l + V r e j r

2

* * = Vl 2 + Vl Vr e ­j   + Vl Vr e j   + Vr 2

(3)

where the * superscript indicates the complex conjugation. The phase-shift between the right and left neurons   =  r ­  l controls the preferred disparity D pref (   )  ­   /  of the binocular complex cell [6]. If we fix the stimulus and allow   to vary between ą  , the function E d( ) in (3) describes the population response of the phase-tuned neurons whose preferred disparities range between ­  /  and  /  . The population response can be completely specified by three features S , P and   [5][4]

E d( ) = S + P cos (   ­   )
where

(4)

S = Vl

2

+ Vr 2 (5)

* P = 2 Vl Vr = 2 Vl Vr *   =  l ­  r = arg ( V l V r )

Figure 2b shows the graphical interpretation of these features. The feature S is the average response across the population. The feature P is the difference between the peak and average responses. Note that S  P , since S ­ P = ( V l ­ V r ) 2 > 0 . The feature   is the peak location of the population response. Peak picking algorithms compute the estimates from the peak location, i.e. d est = ­   /  [6].

3 Feature Analysis
In this section, we suggest a simple feature to differentiate between two classes of stimulus disparities: DIN and DOUT corresponding to stimulus disparities inside ( d   /  ) and outside ( d >  /  ) the range of preferred disparities in the population. We consider the joint distribution of S and P in our analysis. Intuitively, the peak location   is less effective in distinguishing between the in-range and out-of-range disparities, since Figure 1 shows that the true and false peaks fall into the same range of ą  . For illustration purpose, we consider an equivalent set of features: S and R = P / S . The feature R is bounded between 0 and 1 , since S  P .


Because of the uncertainties in the natural scenes, the features S and R are random variables. In making a decision based on random features, Bayesian classifiers minimize the classification error. In particular, decisions are made by comparing how likely the features favor the true hypothesis (DIN) over reference hypothesis (DOUT), through the Bayes factor:

f S, R C ( s, r DIN ) DIN B S, R = ------------------------------------------ > T S, R < f S, R C ( s, r DOUT ) DOUT

(6)

where the threshold T S, R controls the decision boundary on the feature space { S, R } and depends upon the prior class probabilities P(DIN) and P(DOUT) . The function f S, R C(s, r c) is the conditional density of the features given the class c  { DIN, DOUT } . To find the optimal decision boundary for the features S and R , we estimated the joint class likelihood f S, R C(s, r c) by binning the features S and R from the empirical statistics. The statistics of the features are computed using the "Cones" and the "Teddy" stereograms from Middlebury College [8][9], shown in Figure 5a. The stereograms are rectified, where the correspondences are located in the left and right horizontal scan-lines. Each image has 1500 x 1800 pixels. We constructed a population of phase-tuned neurons at each pixel. The disparity neurons had the same spatial frequency and standard deviation, and were selective to vertical orientations. The spatial frequency was  = 2  / 16 radians per pixel and the standard deviation in the horizontal direction was  = 6.78 pixels, corresponding to a spatial bandwidth of 1.8 octaves. The standard deviation in the vertical direction was 2  . The range of the preferred disparities (DIN) of the population is between ą 8 pixels. To reduce the variability in the classification, we also applied a small Gaussian spatial pooling with the standard deviation 0.5  to the population [4][5]. The features S and R computed from population were separated into two classes (DIN and DOUT) according to the ground truth in Figure 5b.

(a)
1 0. 8

(b)
1 0. 8

(c)
0. 9

(d)
8 x 10
-3

R

0. 6 0. 4 0. 2 0

R

0. 6 0. 4

R 0. 7
0. 6 0. 5
5

0. 8

P e 4
2
5 10 15 20

6

0. 2 0

5

S

10

15

20

S

10

15

20

S

0 0.1

0.2

P ( DIN )

0.3

0. 4

0. 5

Fig. 3: The empirical joint density of S and R given (a) DIN and (b) DOUT. (c) The optimal decision boundaries derived from the Bayes factor. (d) The change in total probability of error P e between using a flat boundary (thresholding R ) versus the optimal boundary. Figure 3a-b show the empirical joint likelihoods for the two classes. They were computed by binning the features S and R with the bin sizes of 0.01 for R and 0.25 for S . Given the disparity within the range of preferred disparities (DIN), the joint density concentrates at small S and large R . For the out-of-range disparities (DOUT), the joint density shifts to both large S and small R . Intuitively, a horizontal hyperplane, illustrated by the red dotted line in Figure 3a-b, is an appropriate decision boundary to separate the DIN and DOUT data. This indicates that the feature R can be an indicator to distinguish between the in-range and out-of-range disparities. Mathematically, we can compute the optimal decision boundaries by applying different thresholds to the Bayes factor in (6). Figure 3c shows the boundaries. They are basically flat except they bend downward for small S. We also demonstrate the efficacy of using thresholding instead of using the optimal decision boundaries to distinguish between in-range and out-of-range disparities. Given the prior class probabilities, we compute a hyperplane R ­ c = 0 for c  [ 0, 1 ] that minimizes the total probability of classification error:


P e = P(DIN)

(R ­ c) < 0


f S, R C ( s, r DIN ) + P ( DOUT )

(R ­ c) > 0


f S, R C ( s, r DOUT )

(7)

We then compare this total probability of error with the one computed using the optimal decision boundaries derived in (6). Figure 3d shows the deviation in the total probability of error between ­2 the two approaches for different priors of DIN. The deviation is small (on the order of 10 ) suggesting that thresholding R results in similar performance as using the optimal decision boundaries. Thus, R seems to be a confidence measure for distingishing DIN and DOUT. Moreover, this measure can be explained by the normalization, a common model for V1 neurons [10].

4 Hybrid position-phase model for disparity estimation with validation
phase-tuned population
U l(x)
E d(   )

R 128,   128

R c *

E d(   )

Winner take all

 c = 128

  c *

R > TR

DIN /DOUT
d est

c = 0

E d(   )

U r(x)
 c = ­ 128 Fig. 4: Proposed disparity estimator with the validation of disparity estimates. Our analysis above shows that R is a simple indicator to distinguish between in-range and out-ofrange disparities. In this section, we describe a model that uses this feature to estimate the stimulus disparity with validation. Figure 4 shows the proposed model, which consists of populations hybrid-tuned disparity neurons tuned to different phase-shifts   and position-shifts  c . For each population tuned to the same position-shift but different phase-shifts (phase-tuned population), we compute the feature R  c by normalizing the difference between the peak ( S + P ) and average response ( S ) by S . The average activation S can be computed by pooling the responses of the entire phase-tuned neurons. The features R  c at different position-shifts are compared through a winner-take-all network to select the position-shift  c * with the maximum R  c . The disparity estimate is further refined by the peak location    c * by --c d est =  c * ­ -------------*   (8)

In additional to estimate the stimulus disparity, we also validate the estimates by comparing R  c * with a threshold T R . Instead of choosing a fixed threshold, we vary the threshold to show that the feature R  c can be an occlusion detector. 4.1 Disparity estimation with confidence We applied the proposed model to estimate the disparity of the "Cones" and the "Teddy" stereograms, shown in Figure 5a. The spatial frequency and the spatial standard deviation of the neurons were kept the same as the previous analysis. We also performed the spatial pooling and the orientation pooling to improve the estimation. For spatial pooling, we applied a circularly symmetric Gaussian function with the same standard deviation  . For orientation pooling, we pooled the


(a) Cones

Left

Right

(b)

(c)

Teddy

(d)

estimate

error

(e)

-100

estimate

0

100

error

Fig. 5: (a) The two natural stereograms used to evaluate the model performance. (b) The ground truth disparity maps with respect to the left images, obtained by the structured light method. (c) The ground truth occlusion maps. (d) The disparity maps and the error maps computed by the coarse-tofine approach. (e) The disparity maps and the error maps computed by the proposed model. The detected invalid estimates are labelled in black in the disparity maps. responses over five orientations ranging from 30 to 150 degrees. The range of the position-shifts for the populations was set to the largest disparity range, ą 128 pixels, according to the ground truth. We also implemented the coarse-to-fine model as described in [4] for comparison. In this model, an initial disparity estimate computed from a population of phase-tuned neurons at the coarsest scale is successively refined by the populations of phase-tuned neurons at the finer scales. By choosing the coarsest scale large enough, the disparity range covered by this method can be arbitrarily large. The coarsest and the finest scales had the Gabor periods of 512 and 16 pixels. The Gabor periods of the successive scales differed by a factor of 2 . Neurons at the finest scale had the same RF parameters as our model. Same spatial pooling and orientation pooling were applied on each scale. Figure 5d-e show the estimated disparity maps and the error maps of the two approaches. The error maps show the regions where the disparity estimates exceed 1 pixel of error in the disparity. Both models correctly recover the stimulus disparity at most locations with gradual disparity changes, but tend to make errors at the depth boundaries. However, the proposed model generates more accurate estimates. In the coarse-to-fine model, the percentage of pixels being incorrectly estimated is 36.3%, while our proposed model is only 27.8%. In the coarse-to-fine model, the estimates appear to be blurry around the depth boundaries and tend to make errors. This arises because the model assumption that the stimulus disparity is constant


over the RF of the neuron is unlikely at very large scales. At boundaries, the coarse-to-fine model generates poor initial estimates, which cannot be corrected at the finer scales, because the actual stimulus disparities are outside the range considered at the finer scales. On the other hand, the proposed model is not only able to estimate the stimulus disparity, but it can also validate the estimates. In general, the responses of neurons selective to different position disparities are not comparable, since they depend upon image contrast which varies at different spatial locations. However, the feature R , which is computed by normalizing the response peak by the average response, eliminates such dependency. Moreover, the invalid regions detected (the black regions on the disparity maps) are in excellent agreement with the error labels. 4.2 Occlusion detection In addition to validating the disparity estimates, the feature R can also be used to detect occlusion. Occlusion is one of the challenging problems in stereo vision. Occlusion occurs near the depth discontinuities where there is no correspondence between the left and right images. The disparity in the occlusion regions is undefined. The occlusion regions for these stereograms are shown in Fi gure 5 c. There are three possibilities for image pixels that are labelled as out of range (DOUT). They are occluded pixels, pixels with valid disparity that are incorrectly estimated, and pixels with valid disparity that are correctly estimated (misclassification). Figure 6a shows the percentages of the three possibilities in the DOUT regions for different decision thresholds T R . Formally, these percentages are defined by # of three possibilities in the DOUT regions P1 % = ----------------------------------------------------------------------------------------------------------------------------------- × 100 % # of DOUT labels (9)

where these percentages sum to unity for any thresholds T R . For small thresholds, the detector mainly identifies the occlusion regions first and then regions with incorrect estimates. Figure 6b shows the percentages of the three possibilities being labelled as DOUT in the entire image. Formally, these percentages are defined by # of three possibilities in the DOUT regions P2 % = ----------------------------------------------------------------------------------------------------------------------------------- × 100 % # of three possibilities in the image (10)

For a large threshold ( T R close to unity), all estimates are labelled as DOUT. Thus, the three possibilities are 100%. The proposed detector is effective in identifying occlusion. At the threshold T R = 0.3 , it identifies ~70% of occlusion, ~20% of estimation error and ~10% of misclassification.

(a) P1 (x100%)

1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1

(b) P2 (x100%)

1 0.8 0.6 0.4 0.2 0 0 0. 2 0.4 0.6 0. 8 1

TR

TR

Fig. 6: The percentages of the three possibilities for those images regions that are labelled as out of range (DOUT). They are either due to the occlusion (thick lines), the true error in the estimates (thin lines), or misclassifing the correct estimates as DOUT (dotted lines). (a) Considered only the DOUT regions, (b) Considered the entire image.


5 Discussion
This work suggests a complementary functional role for normalization in neural responses, whose magnitudes can serve as a confidence measure for the disparity estimate implied by the population. Normalization has been encountered in models of V1 neurons [10]. Previous theories emphases that the goal of normalization is to remove the dependencies of image contrast in the neural responses. Even in models of disparity estimation using population of neurons in V1, the normalization is applied to normalize the response strength so that the neural responses tuned to different stimulus dimensions are comparable [11]. However, the normalization described here should not be interpreted as solely reducing dependency upon stimulus energy. Our experimental analysis shows the normalization magnitude R decouples the population response due to the in-range and out-ofrange stimulus disparities and it can detect occlusion effectively. The analysis also indicates that un-normalized responses (e.g. S and P ) do not have this property. References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] H. B. Barlow, C. Blakemore, and J. D. Pettigrew. The neural mechanism of binocular depth discrimination. Journal of Neurophysiology, vol. 193(2), 327-342, 1967. G. F. Poggio, B. C. Motter, S. Squatrito, and Y. Trotter. Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic random-dot stereograms. Vision Res., vol. 25, 397-406, 1985. I. Ohzawa, G. C. Deangelis, and R. D. Freeman. Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detectors. Science, vol. 249, 1037-1041, 1990. Y. Chen and N. Qian. A Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and PositionShift Receptive Field Mechanisms. Neural Computation, vol. 16, 1545-1577, 2004. D. J. Fleet, H. Wagner and D. J. Heeger. Neural encoding of binocular disparity: energy models, position shifts and phase shifts. Vision Research, 1996, vol. 36: 1839-1857. N. Qian, and Y. Zhu. Physiological computation of binocular disparity. Vision Research, vol. 37, 18111827, 1997. S. J. D. Prince, B. G. Cumming, and A. J. Parker. Range and Mechanism of Encoding of Horizontal Disparity in Macaque V1. Journal of Neurophysiology., vol. 87, 209-221, 2002. D. Scharstein and R. Szeliski. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. IJCV, vol. 47(1/2/3), 7-42, 2002. D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured light. IEEE CVPR, vol. 1, 195-202, 2003. D. J. Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, vol. 9, 181198, 1992. S. R. Lehky and T. J. Sejnowski. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity [published erratum appears in Journal of Neuroscience, vol. 11(3), 1991]. Journal of Neuroscience, vol. 10, 2281-2299, 1990.