Universita degli Studi di Udine

Dipartimento di Matematica e informatica

Via delle Scienze, 206 - 33100 UDINE (ITALY) tel. +39 432 558400 fax +39 432 558499 UDMI/54/96/RR

On Stereo Fusion in Humans and Machines V. Roberto, Y. Yeshurun, A. Fusiello, and E. Trucco

Abstract. The computational approaches to stereo fusion are reviewed from both

the human and the machine vision perspectives. Evidences are summarised towards area-based mechanisms underlying depth perception in humans. On the arti cial side, stereo matching techniques are reviewed; experiments are reported with area-based algorithms and random dot stereograms. Results and open problems in both domains are also discussed.

Rapporto Interno a circolazione limitata Internal Report; limited circulation only Sono stati assolti gli obblighi di legge (D. L.Lgt. 31/8/45 no 660 ).

On Stereo Fusion in Humans and Machines V. Roberto, Y. Yeshurun , A. Fusiello, and E. Truccoy ( ) Machine Vision Laboratory, Dept. of Informatics University of Udine I-33100 Udine Italy

() Computer Science Dept, School of Mathematical Science Tel Aviv University Tel Aviv Israel

(y ) Department of Computing and Electrical Engineering Heriot-Watt University Edinburgh, EH14 4AS United Kingdom

Abstract

The computational approaches to stereo fusion are reviewed from both the human and the machine vision perspectives. Evidences are summarised towards area-based mechanisms underlying depth perception in humans. On the arti cial side, stereo matching techniques are reviewed; experiments are reported with area-based algorithms and random dot stereograms. Results and open problems in both domains are also discussed.

1 Introduction Binocular stereo fusion utilises the slightly dierent views of a scene, projected onto the right and left retinas, to recover depth information. From a generic point of view, in both humans and machines the problem reduces to a matching of the two views, in order to nd the displacement (disparity) of corresponding patterns of the projected images. Stereo fusion has been intensively investigated, and a number of computational models have been proposed in human vision - to explain psychophysical and neurophysiological data - and in machine vision - to solve problems such as reconstructing 3D shapes, understand motion and moving shapes.

Historically, the computational approach has provided a uni ed framework for building models, simulating mechanisms, and quantitatively comparing measurements with predictions. In fact, a number of biologically plausible algorithms have been proposed in the past to address machine vision tasks. However, the studies in human and machine stereo have rapidly progressed along distinct paths and with basically dierent purposes, so that the biological plausibility is now a minor concern for the researchers in machine vision. Nevertheless, the human visual system is still a source of extremely helpful ideas; on the other side, researchers on the human vision bene t from new computational structures and algorithms. For these reasons, it is still worthwhile reviewing some of the achievements in the two domains, by exploiting the common language oered by the computational approach. This paper provides a review of the subject, with emphasis on some recent results and open problems in both the domains. In particular, Section 2 presents an introductory review on computational stereo. Some main observations concerning human stereovision are presented in Section 3, and the algorithms are discussed accordingly. A similar review has been reported in Section 4 for machine stereo, with emphasis on the area-based techniques, their advantages and drawbacks. Section 5 contains our conclusions.

2 Computational Stereo: a View from the Bridge We group together all the relevant solutions proposed to the stereo matching problem, derived from both the human and the machine vision literature. The techniques adopted can be classi ed along two dimensions: the kind of image element considered for matching (What to match), and the techniques to compute matching (How to match). In addition, one can be interested in the computational schemes adopted, especially when biological plausibility is of concern. Let us address the rst issue. Some algorithms [1, 2] match individual pixels, i.e., the atomic elements in an image. More robust methods perform matching between gray levels of image patches (windows), by computing the disparity for every pixel [3, 4]; for the centers of the windows [5, 6]; for selected points of interest [7]. The latter methods are also called Area-based. Although several correlation measures have been proposed, the Sum of Squared Dierences (SSD) measure is regarded as a reasonable choice [8, 4, 3, 9, 10, 11, 12]. Some of the problems encountered with matching of raw intensities { arising from the fact that the gray-levels are not identical in the two images { may be overcome by considering the output of a bandpass lter, usually a LoG lter [13]. One could also compute the response of a bank of lters at a given image point, which de nes a vector characterising the local structure of the image [14, 15]. A similar vector is estimated on the other image, in order to compute matching. Matching image features is generally more robust; the related class of algorithms is called feature-based. In the present context, the term \features" indicates physically meaningful cues, such as edges [16, 17, 18, 19, 20]; segments (collinear connected edges) [21]; and corners (where two edges cross) [22]. Features can be extracted by bandpass lters, derivative operators or ad-hoc non2

linear operators. The local phase of the image signal { computed via Fourier or Gabor transforms|has been matched also [23, 24, 25]. As disparity should be less than one pixel to avoid aliasing (according to the sampling theorem, or the \quarter cycle limit" [1]), a multiresolution scheme should be employed. We now come to the second question: How to perform matching? Correlation techniques consist in nding the amount of shifting that yields the maximum similarity score between the left and the right elements [8, 4, 3, 9, 10, 11, 13]. With relaxation-based methods the elements are joined by weighted links; initial weights are iteratively updated by propagating constraints, until some equilibrium con guration is reached [1, 16, 17, 18, 22]. Dynamic programming techniques adopt a cost function, which embeds the constraints and is minimised to get the best set of matches [20, 19, 10, 12, 26, 2]. The solution is a path in the match space [2, 12] or the disparity space [10]. Usually, the energy functional is derived using Bayesian reasoning [12, 26, 2]. A novel approach to matching consists in representing image scanlines by means of the intrinsic curves [27], i.e, the paths followed by a descriptor vector as the scanline is traversed from left to right. Intrinsic curves are invariant to image displacements, and this property is exploited to compute matching. As far as the computational scheme is concerned, algorithms can be classi ed into cooperative, coarse-to- ne and feedforward (see [28] for more details). Cooperative models, pioneered by [1], exploit the properties of the recurrent nets, which perform relaxation to a minimum energy con guration. In coarse-to- ne models, the disparities computed at dierent space scales are fused to compute the nal disparity estimate. In biological vision, coarse-to ne models identify a special class of algorithms using multiple spatial lters that simulate receptive elds [16, 13]. In machine vision, this paradigm is applicable to every scheme, in order to get scale independence and data redundancy. It is mandatory only with phase-based methods. While the cooperative and the coarse-to- ne techniques require cooperative feedback or sequential disparity processing over the spatial scales, the feedforward scheme [5] operates in one shot, like most of the machine stereo algorithms. For further details on machine stereo, the reader may consult the recent surveys in Refs. [29, 30]; a review on human computational stereo is reported in [28].

3 Issues in Human Stereo Fusion

3.1 Observables and measures

Binocular stereopsis is one of the most intensively investigated cues, both algorithmically and psychophysically, among the many cues which humans use for segmenting a three-dimensional scene. Much of the proposed models are based on a direct search for matching features in the left and right images. A parameter which is frequently examined is stereoacuity (stereo depth acuity), determined by the disparity threshold (smallest disparity that yields correct depth perception). Ogle [31, 32] examined the Panum's fusional area, that can 3

be expressed as a range of disparities in which a stereoscopically presented object appears fused and single. Another variable that has been examined is the contrast sensitivity [33, 34, 35]. The common emphasis, in measuring stereoacuity, is on the depth (disparity) domain. Disparities of only a few seconds of arc are detectable [36], classifying stereoscopic vision as a hyperacuity. On the other hand, the humans' ability to estimate absolute depth is quite poor. In [37] are described psychophysical experiments showing that, although the minimum detectable disparity is indeed as small as a few seconds of arc, the disparity increment thresholds are considerably higher than those of visual (spatial) acuity. This is termed the imprecision of stereopsis. The spatial acuity of stereoscopic vision is seldom examined. Indirect relation to the issue of spatial acuity could be found, for example, in [34] and [38]. The Authors of [34] have examined displays with more than one object and found a limit on the disparity gradient, de ned as the dierence in the disparity of two objects, divided by their separation in visual angle. In Ref. [38] is described an interaction between the depth of adjacent stimuli. Westheimer noticed that when two stimuli are only a few minutes of arc apart, a sort of pooling occurs between their two disparities, and the stimuli seem attracted to each other (in depth). On larger distances between the objects (more than 6') the objects act as if they repelled each other in depth. In Ref. [39] is performed an experiment from which the spatial acuity of stereopsis can be estimated quite directly. The Tyler's study revealed a limit on the ability of stereoscopic vision to perceive depth in stimuli with details of a grain ner than 3 cycles/deg. While presenting subjects with vertical line stimuli containing sinusoidal disparity variations, Tyler noticed that a sinusoidal curvature higher than this value was clearly visible monocularly, however a stereoscopic image with the same curvature did not elicit depth perception. Thus, within a region of 10' (the distance between a minimum and a maximum of a sinusoidal grating) depth dierences were not apparent. Other works [40, 41] which investigate the high spatial frequency limitation of stereopsis describe a limit of 3-5 cycles/deg on the spatial frequencies. In the words of [41], \This is equivalent to saying that the depth image is quite blurred compared to the monocular acuities which can extend beyond 50 cycles/deg". These limitations of stereopsis do not pose a problem in the processing of natural scenes, since the presence of monocular cues allows the use of other visual mechanisms to determine the object's shape. Only the use of randomdot stereogram (RDS) stimuli can manifest the coarser grain of stereopsis, since the shape is not visible monocularly.

3.2 The area-based assumption

All these studies imply that stereoscopic vision operates in a manner much coarser than monocular vision. These ndings|that depth could not be accurately perceived in stimuli with details of a grain ner than 3 cycles/deg|could be interpreted to predict that the spatial acuity of stereopsis is about 10' (the distance between a minimum and a maximum of a sinusoidal grating). This result agrees also with the nding of [36], who suggests that \The mosaic of disparity detection is much coarser than that of feature detection". 4

The spatial size limit in stereopsis, which is an order of magnitude larger than visual spatial acuity, seems to indicate that stereopsis is an area-based process, rather than a point process. In a point process, the disparity is estimated for every \pixel" in the input image, and thus the spatial resolution of the output is the same as the spatial resolution of the input. In an area-based process, a single output value is computed only for whole areas of the input, and thus the resolution of the output is lower than the resolution of the input. In our case, it means that when the only available input is pure disparity|i.e., no monocular cues|, a single disparity value is computed for every \area" rather than for every \pixel". In general, the size constant of an area process is revealed by the resolution of the process. Since a single output value is assigned to each area, the size of the area is at least the resolution. Thus, a size limit of 8' means that the area involved is at least of this size. In this regard it is interesting to note a recent nding [42], indicating that within a visual area of about 10', only a single coherent motion can be perceived. This might suggest that similar, area-based processing takes place also in motion perception.

3.3 Algorithms vs. observations

The computational models of human stereoscopic vision have been classi ed [28] as Cooperative, Coarse-to- ne, and Direct feedforward (see also Section 2). The algorithm by Marr and Poggio [1] is here reported as a representative of the cooperative approach.

Algorithm 1 Marr and Poggio C (x; y; d) = 1 if x; y; d correspond to a match in the original data.

Until C satis es some convergence criterion, do Cn+1 (x; y; d)

T[

X x ;y ;d 2S 0

0

0

where T [x] =

S

Cn (x0 ; y 0 ; d0 ) ?

X x ;y ;d 2S 0

0

Cn (x0 ; y 0 ; d0 ) + C0 (x; y; d)]

0

(

0 if x > t 1 otherwise

= set of points x0 ; y0 ; d0 such that jx ? x0 j 1 and d = d0 = set of points x0 ; y0 ; d0 such that jx ? x0 j 1 and jd ? d0 j = 1

end

end 5

Cooperative models do not agree too well with the `area process" assumption, since the nal disparity map they estimate is in the same resolution of the input, namely, visual acuity. Coarse-to- ne models carry out the matching process in dierent scales, and thus, could be regarded as re ecting the process of the coarser channel. As far as Direct feedforward models are concerned [5, 6], we report the algorithm proposed in [5]:

Algorithm 2 Yeshurun and Schwartz Rectangular patches extracted from left and right images are butted against each other to form a joint image I(x,y). Compute cepstrum:

CfI (x; y)g = jjFflog(jjFfI (x; y)gjj )gjj where Ff:g denotes the Fourier transform. 2

2

Compute a disparity eld by detecting peaks in C .

end Direct feedforward, area-based models indeed predict that a single depth (disparity) estimate would be associated with each \area patch", and are in full agreement with the area-based assumption.

4 On Machine Stereo 4.1 The problem

The aim of machine stereo is reconstructing the 3-D geometry of a scene from two|left and right|views, taken by pinhole cameras. With automated systems the problem can be decomposed and formalised more clearly. Two distinct processes are involved: correspondence (matching) and reconstruction. The former estimates which points in the left and right images called pl and pr respectively - are projections of the same scene point P; it allows to compute the 2-D displacement vector between pl and pr , called the disparity of the image point pl (a similar de nition can be given in terms of pr , of course). Reconstruction recovers the depth of point P, using the estimated disparity and a model of the stereo rig specifying the pose of each camera and its optical parameters. The measurement of camera model parameters is known as calibration, and is a problem on its own [43], which will not be addressed in this paper. Matching can be regarded as a search problem, since for each element on the left image (a point, region, or generic feature), a similar element is to be 6

Figure 1: Square random-dot stereogram. The right image of the stereogram is computed by warping the left one (left), according to a given disparity pattern (right): the square has disparity 10 pixel, the background 3 pixel.

Figure 2: Computed disparity map by SSD correlation for the square random-dot stereogram in Fig. 1 with 3 3 window (left) and 7 7 window (right); MAE is 0.240 and 0.144, respectively.

found on the right one, according to a given similarity measure. To prevent ambiguous (false) matches and avoid combinatorial explosion, the search space must be suitably constrained. Geometric and physical constraints can be put both on the stereo system and the objects in the scene. The epipolar constraint (see for example [43]) states that candidate matches of a given point lie on a straight (epipolar) line, thus reducing correspondence to a one-dimensional search. Most of the real scenes satisfy constraints, such as: (i) smoothness - the distance of scene points from the cameras changes smoothly almost everywhere (the weak smoothness constraint allows for depth discontinuities, whilst the strong one does not); (ii) uniqueness - each image element has one and only one counterpart; (iii) ordering - order relations holding on the left image are preserved on the right one. Photometric constraints concern lighting conditions and re ectance properties of surfaces. A typical assumption is that the light source is a point at in nity and the surfaces are lambertian, i.e., the perceived intensity does not depend on the viewing direction, but only on the angle between the incident radiation and the surface normal. Major problems aecting machine stereo are (i) the occlusions, that make image elements have no counterpart to be matched with; (ii) the photometric distortions, arising when the lambertian constraint is violated, so that the same world point takes dierent observed intensities in each view; (iii) the gural distortion that makes the projected shapes dierent on the left and right images.

4.2 Experiments with area-based techniques

The experiences collected with machine stereo algorithms cluster around the area-based versus feature-based algorithms. The former yield dense depth maps|and so are preferable when 3D shape recovery is of concern|but are computationally more expensive; the latter are robust and more ecient, but yield sparse disparity maps to be further interpolated. We report experiments with the area-based algorithms. Random dot stereo pairs have been used, a widely adopted testbed in computational stereo. 7

It should be emphasised that the term \area-based" is given a slightly different meaning in human and in machine stereo. In the former domain 3.2 it denotes the mapping of an input area onto an output disparity estimate; in machine stereo it indicates a class of techniques to estimate disparity|even of individual pixels|by means of neighbouring pixels (a window). Area-based algorithms, in the former sense, arise in machine stereo as special cases, for example when coarse-to- ne (multi-scale) techniques are adopted [44, 4, 11] We also remark that for area-based algorithms to be applicable, the following constraints should be veri ed: (i) surfaces are textured (ii) surfaces are lambertian; (iii) gural distorsion is negligible. The Sum of Squared Dierences (SSD) basic correlation algorithm is here reported.

Algorithm 3 Basic SSD let Ir , Il the right and left N N images, respectively; let W n n window (with n N ); for each pixel Il (x; y) for each disparity d = (dx ; dy ) in some range Compute C (x; y;

end

end dl (x; y)

d) =

X

2W

(; )

[Il (x + ; y + ) ? Ir (x + ? dx ; y + ? dy )]2 ;

arg mind C (x; y; d)

end The estimated disparity is the one which minimises the euclidean distance (maximises the similarity) between the right and left areas. The asymptotic complexity is O(n2 N 2 ). The accuracy is at the pixel level, but subpixel precision can be achieved [8]. In Fig. 2 we show the results of the application of the Basic SSD algorithm to a random dot stereo pair. The Mean Absolute Error (MAE) is estimated as the mean of the absolute dierences between computed and true disparities. Even under simpli ed conditions, it appears that the choice of the window size is critical (Fig 2). A window too small is noise-sensitive, whereas an exceedingly large one acts as a low-pass lter, and is likely to miss depth discontinuities.

4.3 Some steps further: treating the occlusions

Naive SSD techniques are not capable to address neither the depth discontinuities, nor the occlusions. The former problem is addressed eectively|although not eciently|by the Adaptive Window algorithm [11], but still the occlusions require ad-hoc solutions. 8

There are two key observations to address the occlusions problem: (i) matching is not a symmetric process. When searching for corresponding elements, only the visible points in one image are matched; (ii) in many real cases a depth discontinuity corresponds to an occlusion in the other image. Some Authors [4, 3] use the observation (i) to validate matching (left-right consistency); others [12, 2] use (ii) to constrain the search space. Recently, a new algorithm has been proposed [45] that computes disparity by exploiting the left-right consistency constraint. For each pixel, a correlation is performed with nine 7 7 dierent windows: the disparity with the smallest SSD error value is retained. Occlusions are also detected, by checking the left-right consistency and suppressing unfeasible matches accordingly. The algorithm is reported in the following:

Algorithm 4 Symmetry Based Stereo (SBS) for all (x; y) in the left image Il do for all windows w do dl;w (x) arg mind C (x; y; d; Il ; Ir ; w) dr;w (x) arg mind C (x; y; d; Ir ; Il ; w) end PN d2 (x) = N 1?1 w=1 (dl;w (x) ? dl;w (x))2 : dl (x) arg minw C (x; y; dl;w ; Il ; Ir ; w) dr (x) arg minw C (x; y; dr;w ; Ir ; Il ; w) d(x) dl (x) + subpixel end for all (x; y) in Il do if (dl (x) 6= ?dr (x + dl (x)) d2 (x) + 1 end

end The SBS algorithm has been applied to the square random-dot stereograms of Fig. 1. Fig. 3 and Fig. 4 show the disparity maps computed by SBS and the estimated uncertainty maps (the darker the lower) in both cases. The estimated MAE is negligible and may be ascribed to the subpixel accuracy only. The occluded points, shown in white in the uncertainty maps are recovered with 100% accuracy, in both cases.

5 Conclusions The computational techniques adopted in both human and machine stereo have been reviewed, as well as some of the results arising from the experimental observations. Further links emerge among the human and the arti cial sides of stereo fusion. Evidences from Psychophysics seem to point toward area-based mechanisms 9

12

12

10

10

8

8

6

6

4

4

2

2

0 150

0 150 150

100

150

100

100 50

50 0

Figure 3: Computed disparity map (left) by SBS for the square randomdot stereogram and its uncertainty (right). MAE is 0.019.

0

100 50

50 0

0

Figure 4: Left: Isometric plot of the disparity map in Fig. 2 left. Right: Isometric plot of the disparity map in Fig. 3.

underlying both depth and motion perception in humans. On the other hand, area-based algorithms have been proposed in the domain of machine stereo to provide robust, dense depth maps for object and motion reconstruction. Although the term \area-based" takes a slightly dierent meaning in the two domains, still it is interesting to note that robust fusion mechanisms are requested anyhow, involving an image point and its neighbourhood. Open problems in machine stereo concern the treatment of occlusions and depth discontinuities: the SBS algorithm does some steps further, but still the space scale (i.e. the window size) is a free parameter. A coarse-to- ne strategy would add a exible, scale-invariant mechanism to adapt the window size to the local image texture and disparity elds. Moreover, having segmented the occlusions, a relevant issue is how to exploit this valuable piece of information in the depth reconstruction process. Once more in the history of stereovision, a cross-disciplinary research eort appears promising.

References [1] D. Marr and T. Poggio. Cooperative computation of stereo disparity. Science, 194:283{287, 1976. [2] I. J. Cox, S. Hingorani, B. M. Maggs, and S. B. Rao. A maximum likelihood stereo algorithm. Computer Vision and Image Understanding, 63(3):542{ 567, 1996. [3] O. Faugeras, B. Hotz, H. Mathieu, T. Vieville, Z. Zhang, P. Fua, E. Theron, L. Moll, G. Berry, J. Vuillemin, P. Bertin, and C. Proy. Real-time correlation-based stereo: algorithm, implementation and applications. Technical Report 2013, Unite de Recherche INRIA Sophia-Antipolis, 1993. [4] P. Fua. Combining stereo and monocular information to compute dense depth maps that preserve depth discontinuities. In Proceedings of the International Joint Conference on Arti cial Intelligence, Sydney, Australia, August 1991. [5] Y. Yeshurun and E. L. Schwartz. Cepstral ltering on a columnar image architecture: a fast algorithm for binocular stereo segmentation. IEEE 10

[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

Transactions on Pattern Analysis and Machine Intelligence, 11:759{767, 1989. K. O. Ludwig, H. Neumann, and B. Neumann. Local stereoscopic depth estimation. Image and Vision Computing, 12:16{35, 1994. M.J. Hannah. A system for digital stereo image matching. Photogrammatic Engineering and Remote Sensing, pages 1765{1770, 1989. P. Anandan. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2:283{ 310, 1989. M. Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4):353{363, 1993. S. S. Intille and A. F. Bobick. Disparity-space images and large occlusion stereo. In Jan-Olof Eklundh, editor, European Conference on Computer Vision, pages 179{186, Stockholm, Sweden, 1994. Springer-Verlag. T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window: Theory and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9):920{932, 1994. D. Geiger, B. Ladendorf, and A. Yuille. Occlusions and binocular stereo. International Journal of Computer Vision, 14:211{226, 1995. H. K. Nishihara. PRISM, a pratical real-time imaging stereo matcher. A.I. Memo 780, Massachusetts Institute of Technology, 1984. D. G. Jones and J. Malik. Computational framework for determining stereo correspondence from a set of linear spatial lters. Image and Vision Computing, 10(10):699{708, 1992. J. Weng, N. Ahuja, and T.S. Huang. Matching two perspective views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(8):806{825, 1992. D. Marr and T. Poggio. A theory of human stereo vision. A.I. Memo 451, Massachusetts Institute of Technology, November 1977. W.E.L. Grimson. Computational experiments with a feature based stereo algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(1):17{34, 1985. S. B. Pollard, J.E.W. Mayhew, and J.P. Frisby. PMF: A stereo correspondence algorithm using a disparity gradient constraint. Perception, 14:449{ 470, 1985. H. H. Baker and T. O. Binford. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Arti cial Intelligence, 1981. Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline search using dynamic programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(2):139{154, 1985.

11

[21] G. Medioni and R. Nevatia. Segment-based stereo matching. Computer Vision, Graphics, and Image Processing, 31:2{18, 1985. [22] S. T. Barnard and W. B. Thompson. Disparity analysis of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(4):333{340, 1980. [23] M. R. M. Jenkin, A. D. Jepson, and J. K. Tsotsos. Techniques for disparity measurements. CVGIP: Image Understanding, 53(1):14{30, 1991. [24] M. R. M. Jenkin and A. D. Jepson. Recovering local surface structure through local phase dierence measurements. CVGIP: Image Understanding, 59(1):72{93, 1994. [25] R. D. Henkel. Hierarchical calculation of 3d-structure. Technical Report 5/94, Zentrum fur Kognitionswissenschaften, Universitat Bremen, 1994. [26] P. N. Belhumeur. A bayesian approach to binocular stereopsis. International Journal of Computer Vision, 19(3):237{260, 1996. [27] C. Tomasi and R. Manduchi. Stereo without search. In European Conference on Computer Vision, pages 452{465, 1996. [28] R. Blake and H. R. Wilson. Neural models of stereoscopic vision. Trends in Neuroscience, 14:445{452, 1991. [29] R. C. Bolles, H. H. Baker, and M. J. Hannah. The JISCT stereo evaluation. Technical report, SRI International, January 1993. [30] U. R. Dhond and J. K. Aggarwal. Structure from stereo { a review. IEEE Transactions on Systems, Man and Cybernetics, 19(6):1489{1510, 1989. [31] K. N. Ogle. Researches in Binocular Vision. Saunders, Philadelphia, 1950. [32] K. N. Ogle. Disparity limits of stereopsis. Archives of Ophthalmology, 48:50{60, 1952. [33] J. P. Frisby and J. E. W. Mayhew. Contrast sensitivity function for stereopsis. Perception, 7:423{429, 1978. [34] D. L. Halpern and R. Blake. How contrast aects stereoacuity. Perception, 17:483{495, 1988. [35] A. Arditi. Binocular vision. In K. R. Bo, L. Kaufman, and J. P. Thomas, editors, Handbook of perception and performance: Vol. 1., Sensory processes and perception, pages 23.1{23.41. Wiley, New York, 1986. [36] G. Westheimer. The Ferrier lecture, 1992. Seeing depth with two eyes: stereopsis. Proc. R. Soc. Lond. B, 257:205{214, 1994. [37] S. P. McKee, D. M. Levi, and S. F. Bowne. The imprecision of stereopsis. Vision Research, 30:1763{1779, 1990. [38] G. Westheimer. Spatial interaction in the domain of disparity signals in human stereoscopic vision. Journal of Physiology, 370:619{629, 1986. 12

[39] C. W. Tyler. Stereoscopic vision: Cortical limitations and a disparity scaling eect. Science, 181:276{278, 1973. [40] C. W. Tyler. Depth perception in disparity gratings. Nature, 251:140{142, 1974. [41] C. W. Tyler and B. Julesz. On the depth of the cyclopean retina. Experimental Brain Research, 40:196{202, 1980. [42] Y. Hermush and Y. Yeshurun. Spatial gradient limit on perception of multiple motion. Perception, 24:1247{1256, 1995. [43] O. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. The MIT Press, Cambridge, 1993. [44] G. Xu, S. Tsui, and M. Asada. Coarse-to- ne strategy for matching motion stereo pairs. In Proceedings of the International Joint Conference on Arti cial Intelligence, pages 892{894, 1985. [45] A Fusiello, V. Roberto, and E. Trucco. Ecient stereo with multiple windowing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, S.Juan, Puerto Rico, June 1997. IEEE Computer Society Press. To appear.

13

Dipartimento di Matematica e informatica

Via delle Scienze, 206 - 33100 UDINE (ITALY) tel. +39 432 558400 fax +39 432 558499 UDMI/54/96/RR

On Stereo Fusion in Humans and Machines V. Roberto, Y. Yeshurun, A. Fusiello, and E. Trucco

Abstract. The computational approaches to stereo fusion are reviewed from both

the human and the machine vision perspectives. Evidences are summarised towards area-based mechanisms underlying depth perception in humans. On the arti cial side, stereo matching techniques are reviewed; experiments are reported with area-based algorithms and random dot stereograms. Results and open problems in both domains are also discussed.

Rapporto Interno a circolazione limitata Internal Report; limited circulation only Sono stati assolti gli obblighi di legge (D. L.Lgt. 31/8/45 no 660 ).

On Stereo Fusion in Humans and Machines V. Roberto, Y. Yeshurun , A. Fusiello, and E. Truccoy ( ) Machine Vision Laboratory, Dept. of Informatics University of Udine I-33100 Udine Italy

() Computer Science Dept, School of Mathematical Science Tel Aviv University Tel Aviv Israel

(y ) Department of Computing and Electrical Engineering Heriot-Watt University Edinburgh, EH14 4AS United Kingdom

Abstract

The computational approaches to stereo fusion are reviewed from both the human and the machine vision perspectives. Evidences are summarised towards area-based mechanisms underlying depth perception in humans. On the arti cial side, stereo matching techniques are reviewed; experiments are reported with area-based algorithms and random dot stereograms. Results and open problems in both domains are also discussed.

1 Introduction Binocular stereo fusion utilises the slightly dierent views of a scene, projected onto the right and left retinas, to recover depth information. From a generic point of view, in both humans and machines the problem reduces to a matching of the two views, in order to nd the displacement (disparity) of corresponding patterns of the projected images. Stereo fusion has been intensively investigated, and a number of computational models have been proposed in human vision - to explain psychophysical and neurophysiological data - and in machine vision - to solve problems such as reconstructing 3D shapes, understand motion and moving shapes.

Historically, the computational approach has provided a uni ed framework for building models, simulating mechanisms, and quantitatively comparing measurements with predictions. In fact, a number of biologically plausible algorithms have been proposed in the past to address machine vision tasks. However, the studies in human and machine stereo have rapidly progressed along distinct paths and with basically dierent purposes, so that the biological plausibility is now a minor concern for the researchers in machine vision. Nevertheless, the human visual system is still a source of extremely helpful ideas; on the other side, researchers on the human vision bene t from new computational structures and algorithms. For these reasons, it is still worthwhile reviewing some of the achievements in the two domains, by exploiting the common language oered by the computational approach. This paper provides a review of the subject, with emphasis on some recent results and open problems in both the domains. In particular, Section 2 presents an introductory review on computational stereo. Some main observations concerning human stereovision are presented in Section 3, and the algorithms are discussed accordingly. A similar review has been reported in Section 4 for machine stereo, with emphasis on the area-based techniques, their advantages and drawbacks. Section 5 contains our conclusions.

2 Computational Stereo: a View from the Bridge We group together all the relevant solutions proposed to the stereo matching problem, derived from both the human and the machine vision literature. The techniques adopted can be classi ed along two dimensions: the kind of image element considered for matching (What to match), and the techniques to compute matching (How to match). In addition, one can be interested in the computational schemes adopted, especially when biological plausibility is of concern. Let us address the rst issue. Some algorithms [1, 2] match individual pixels, i.e., the atomic elements in an image. More robust methods perform matching between gray levels of image patches (windows), by computing the disparity for every pixel [3, 4]; for the centers of the windows [5, 6]; for selected points of interest [7]. The latter methods are also called Area-based. Although several correlation measures have been proposed, the Sum of Squared Dierences (SSD) measure is regarded as a reasonable choice [8, 4, 3, 9, 10, 11, 12]. Some of the problems encountered with matching of raw intensities { arising from the fact that the gray-levels are not identical in the two images { may be overcome by considering the output of a bandpass lter, usually a LoG lter [13]. One could also compute the response of a bank of lters at a given image point, which de nes a vector characterising the local structure of the image [14, 15]. A similar vector is estimated on the other image, in order to compute matching. Matching image features is generally more robust; the related class of algorithms is called feature-based. In the present context, the term \features" indicates physically meaningful cues, such as edges [16, 17, 18, 19, 20]; segments (collinear connected edges) [21]; and corners (where two edges cross) [22]. Features can be extracted by bandpass lters, derivative operators or ad-hoc non2

linear operators. The local phase of the image signal { computed via Fourier or Gabor transforms|has been matched also [23, 24, 25]. As disparity should be less than one pixel to avoid aliasing (according to the sampling theorem, or the \quarter cycle limit" [1]), a multiresolution scheme should be employed. We now come to the second question: How to perform matching? Correlation techniques consist in nding the amount of shifting that yields the maximum similarity score between the left and the right elements [8, 4, 3, 9, 10, 11, 13]. With relaxation-based methods the elements are joined by weighted links; initial weights are iteratively updated by propagating constraints, until some equilibrium con guration is reached [1, 16, 17, 18, 22]. Dynamic programming techniques adopt a cost function, which embeds the constraints and is minimised to get the best set of matches [20, 19, 10, 12, 26, 2]. The solution is a path in the match space [2, 12] or the disparity space [10]. Usually, the energy functional is derived using Bayesian reasoning [12, 26, 2]. A novel approach to matching consists in representing image scanlines by means of the intrinsic curves [27], i.e, the paths followed by a descriptor vector as the scanline is traversed from left to right. Intrinsic curves are invariant to image displacements, and this property is exploited to compute matching. As far as the computational scheme is concerned, algorithms can be classi ed into cooperative, coarse-to- ne and feedforward (see [28] for more details). Cooperative models, pioneered by [1], exploit the properties of the recurrent nets, which perform relaxation to a minimum energy con guration. In coarse-to- ne models, the disparities computed at dierent space scales are fused to compute the nal disparity estimate. In biological vision, coarse-to ne models identify a special class of algorithms using multiple spatial lters that simulate receptive elds [16, 13]. In machine vision, this paradigm is applicable to every scheme, in order to get scale independence and data redundancy. It is mandatory only with phase-based methods. While the cooperative and the coarse-to- ne techniques require cooperative feedback or sequential disparity processing over the spatial scales, the feedforward scheme [5] operates in one shot, like most of the machine stereo algorithms. For further details on machine stereo, the reader may consult the recent surveys in Refs. [29, 30]; a review on human computational stereo is reported in [28].

3 Issues in Human Stereo Fusion

3.1 Observables and measures

Binocular stereopsis is one of the most intensively investigated cues, both algorithmically and psychophysically, among the many cues which humans use for segmenting a three-dimensional scene. Much of the proposed models are based on a direct search for matching features in the left and right images. A parameter which is frequently examined is stereoacuity (stereo depth acuity), determined by the disparity threshold (smallest disparity that yields correct depth perception). Ogle [31, 32] examined the Panum's fusional area, that can 3

be expressed as a range of disparities in which a stereoscopically presented object appears fused and single. Another variable that has been examined is the contrast sensitivity [33, 34, 35]. The common emphasis, in measuring stereoacuity, is on the depth (disparity) domain. Disparities of only a few seconds of arc are detectable [36], classifying stereoscopic vision as a hyperacuity. On the other hand, the humans' ability to estimate absolute depth is quite poor. In [37] are described psychophysical experiments showing that, although the minimum detectable disparity is indeed as small as a few seconds of arc, the disparity increment thresholds are considerably higher than those of visual (spatial) acuity. This is termed the imprecision of stereopsis. The spatial acuity of stereoscopic vision is seldom examined. Indirect relation to the issue of spatial acuity could be found, for example, in [34] and [38]. The Authors of [34] have examined displays with more than one object and found a limit on the disparity gradient, de ned as the dierence in the disparity of two objects, divided by their separation in visual angle. In Ref. [38] is described an interaction between the depth of adjacent stimuli. Westheimer noticed that when two stimuli are only a few minutes of arc apart, a sort of pooling occurs between their two disparities, and the stimuli seem attracted to each other (in depth). On larger distances between the objects (more than 6') the objects act as if they repelled each other in depth. In Ref. [39] is performed an experiment from which the spatial acuity of stereopsis can be estimated quite directly. The Tyler's study revealed a limit on the ability of stereoscopic vision to perceive depth in stimuli with details of a grain ner than 3 cycles/deg. While presenting subjects with vertical line stimuli containing sinusoidal disparity variations, Tyler noticed that a sinusoidal curvature higher than this value was clearly visible monocularly, however a stereoscopic image with the same curvature did not elicit depth perception. Thus, within a region of 10' (the distance between a minimum and a maximum of a sinusoidal grating) depth dierences were not apparent. Other works [40, 41] which investigate the high spatial frequency limitation of stereopsis describe a limit of 3-5 cycles/deg on the spatial frequencies. In the words of [41], \This is equivalent to saying that the depth image is quite blurred compared to the monocular acuities which can extend beyond 50 cycles/deg". These limitations of stereopsis do not pose a problem in the processing of natural scenes, since the presence of monocular cues allows the use of other visual mechanisms to determine the object's shape. Only the use of randomdot stereogram (RDS) stimuli can manifest the coarser grain of stereopsis, since the shape is not visible monocularly.

3.2 The area-based assumption

All these studies imply that stereoscopic vision operates in a manner much coarser than monocular vision. These ndings|that depth could not be accurately perceived in stimuli with details of a grain ner than 3 cycles/deg|could be interpreted to predict that the spatial acuity of stereopsis is about 10' (the distance between a minimum and a maximum of a sinusoidal grating). This result agrees also with the nding of [36], who suggests that \The mosaic of disparity detection is much coarser than that of feature detection". 4

The spatial size limit in stereopsis, which is an order of magnitude larger than visual spatial acuity, seems to indicate that stereopsis is an area-based process, rather than a point process. In a point process, the disparity is estimated for every \pixel" in the input image, and thus the spatial resolution of the output is the same as the spatial resolution of the input. In an area-based process, a single output value is computed only for whole areas of the input, and thus the resolution of the output is lower than the resolution of the input. In our case, it means that when the only available input is pure disparity|i.e., no monocular cues|, a single disparity value is computed for every \area" rather than for every \pixel". In general, the size constant of an area process is revealed by the resolution of the process. Since a single output value is assigned to each area, the size of the area is at least the resolution. Thus, a size limit of 8' means that the area involved is at least of this size. In this regard it is interesting to note a recent nding [42], indicating that within a visual area of about 10', only a single coherent motion can be perceived. This might suggest that similar, area-based processing takes place also in motion perception.

3.3 Algorithms vs. observations

The computational models of human stereoscopic vision have been classi ed [28] as Cooperative, Coarse-to- ne, and Direct feedforward (see also Section 2). The algorithm by Marr and Poggio [1] is here reported as a representative of the cooperative approach.

Algorithm 1 Marr and Poggio C (x; y; d) = 1 if x; y; d correspond to a match in the original data.

Until C satis es some convergence criterion, do Cn+1 (x; y; d)

T[

X x ;y ;d 2S 0

0

0

where T [x] =

S

Cn (x0 ; y 0 ; d0 ) ?

X x ;y ;d 2S 0

0

Cn (x0 ; y 0 ; d0 ) + C0 (x; y; d)]

0

(

0 if x > t 1 otherwise

= set of points x0 ; y0 ; d0 such that jx ? x0 j 1 and d = d0 = set of points x0 ; y0 ; d0 such that jx ? x0 j 1 and jd ? d0 j = 1

end

end 5

Cooperative models do not agree too well with the `area process" assumption, since the nal disparity map they estimate is in the same resolution of the input, namely, visual acuity. Coarse-to- ne models carry out the matching process in dierent scales, and thus, could be regarded as re ecting the process of the coarser channel. As far as Direct feedforward models are concerned [5, 6], we report the algorithm proposed in [5]:

Algorithm 2 Yeshurun and Schwartz Rectangular patches extracted from left and right images are butted against each other to form a joint image I(x,y). Compute cepstrum:

CfI (x; y)g = jjFflog(jjFfI (x; y)gjj )gjj where Ff:g denotes the Fourier transform. 2

2

Compute a disparity eld by detecting peaks in C .

end Direct feedforward, area-based models indeed predict that a single depth (disparity) estimate would be associated with each \area patch", and are in full agreement with the area-based assumption.

4 On Machine Stereo 4.1 The problem

The aim of machine stereo is reconstructing the 3-D geometry of a scene from two|left and right|views, taken by pinhole cameras. With automated systems the problem can be decomposed and formalised more clearly. Two distinct processes are involved: correspondence (matching) and reconstruction. The former estimates which points in the left and right images called pl and pr respectively - are projections of the same scene point P; it allows to compute the 2-D displacement vector between pl and pr , called the disparity of the image point pl (a similar de nition can be given in terms of pr , of course). Reconstruction recovers the depth of point P, using the estimated disparity and a model of the stereo rig specifying the pose of each camera and its optical parameters. The measurement of camera model parameters is known as calibration, and is a problem on its own [43], which will not be addressed in this paper. Matching can be regarded as a search problem, since for each element on the left image (a point, region, or generic feature), a similar element is to be 6

Figure 1: Square random-dot stereogram. The right image of the stereogram is computed by warping the left one (left), according to a given disparity pattern (right): the square has disparity 10 pixel, the background 3 pixel.

Figure 2: Computed disparity map by SSD correlation for the square random-dot stereogram in Fig. 1 with 3 3 window (left) and 7 7 window (right); MAE is 0.240 and 0.144, respectively.

found on the right one, according to a given similarity measure. To prevent ambiguous (false) matches and avoid combinatorial explosion, the search space must be suitably constrained. Geometric and physical constraints can be put both on the stereo system and the objects in the scene. The epipolar constraint (see for example [43]) states that candidate matches of a given point lie on a straight (epipolar) line, thus reducing correspondence to a one-dimensional search. Most of the real scenes satisfy constraints, such as: (i) smoothness - the distance of scene points from the cameras changes smoothly almost everywhere (the weak smoothness constraint allows for depth discontinuities, whilst the strong one does not); (ii) uniqueness - each image element has one and only one counterpart; (iii) ordering - order relations holding on the left image are preserved on the right one. Photometric constraints concern lighting conditions and re ectance properties of surfaces. A typical assumption is that the light source is a point at in nity and the surfaces are lambertian, i.e., the perceived intensity does not depend on the viewing direction, but only on the angle between the incident radiation and the surface normal. Major problems aecting machine stereo are (i) the occlusions, that make image elements have no counterpart to be matched with; (ii) the photometric distortions, arising when the lambertian constraint is violated, so that the same world point takes dierent observed intensities in each view; (iii) the gural distortion that makes the projected shapes dierent on the left and right images.

4.2 Experiments with area-based techniques

The experiences collected with machine stereo algorithms cluster around the area-based versus feature-based algorithms. The former yield dense depth maps|and so are preferable when 3D shape recovery is of concern|but are computationally more expensive; the latter are robust and more ecient, but yield sparse disparity maps to be further interpolated. We report experiments with the area-based algorithms. Random dot stereo pairs have been used, a widely adopted testbed in computational stereo. 7

It should be emphasised that the term \area-based" is given a slightly different meaning in human and in machine stereo. In the former domain 3.2 it denotes the mapping of an input area onto an output disparity estimate; in machine stereo it indicates a class of techniques to estimate disparity|even of individual pixels|by means of neighbouring pixels (a window). Area-based algorithms, in the former sense, arise in machine stereo as special cases, for example when coarse-to- ne (multi-scale) techniques are adopted [44, 4, 11] We also remark that for area-based algorithms to be applicable, the following constraints should be veri ed: (i) surfaces are textured (ii) surfaces are lambertian; (iii) gural distorsion is negligible. The Sum of Squared Dierences (SSD) basic correlation algorithm is here reported.

Algorithm 3 Basic SSD let Ir , Il the right and left N N images, respectively; let W n n window (with n N ); for each pixel Il (x; y) for each disparity d = (dx ; dy ) in some range Compute C (x; y;

end

end dl (x; y)

d) =

X

2W

(; )

[Il (x + ; y + ) ? Ir (x + ? dx ; y + ? dy )]2 ;

arg mind C (x; y; d)

end The estimated disparity is the one which minimises the euclidean distance (maximises the similarity) between the right and left areas. The asymptotic complexity is O(n2 N 2 ). The accuracy is at the pixel level, but subpixel precision can be achieved [8]. In Fig. 2 we show the results of the application of the Basic SSD algorithm to a random dot stereo pair. The Mean Absolute Error (MAE) is estimated as the mean of the absolute dierences between computed and true disparities. Even under simpli ed conditions, it appears that the choice of the window size is critical (Fig 2). A window too small is noise-sensitive, whereas an exceedingly large one acts as a low-pass lter, and is likely to miss depth discontinuities.

4.3 Some steps further: treating the occlusions

Naive SSD techniques are not capable to address neither the depth discontinuities, nor the occlusions. The former problem is addressed eectively|although not eciently|by the Adaptive Window algorithm [11], but still the occlusions require ad-hoc solutions. 8

There are two key observations to address the occlusions problem: (i) matching is not a symmetric process. When searching for corresponding elements, only the visible points in one image are matched; (ii) in many real cases a depth discontinuity corresponds to an occlusion in the other image. Some Authors [4, 3] use the observation (i) to validate matching (left-right consistency); others [12, 2] use (ii) to constrain the search space. Recently, a new algorithm has been proposed [45] that computes disparity by exploiting the left-right consistency constraint. For each pixel, a correlation is performed with nine 7 7 dierent windows: the disparity with the smallest SSD error value is retained. Occlusions are also detected, by checking the left-right consistency and suppressing unfeasible matches accordingly. The algorithm is reported in the following:

Algorithm 4 Symmetry Based Stereo (SBS) for all (x; y) in the left image Il do for all windows w do dl;w (x) arg mind C (x; y; d; Il ; Ir ; w) dr;w (x) arg mind C (x; y; d; Ir ; Il ; w) end PN d2 (x) = N 1?1 w=1 (dl;w (x) ? dl;w (x))2 : dl (x) arg minw C (x; y; dl;w ; Il ; Ir ; w) dr (x) arg minw C (x; y; dr;w ; Ir ; Il ; w) d(x) dl (x) + subpixel end for all (x; y) in Il do if (dl (x) 6= ?dr (x + dl (x)) d2 (x) + 1 end

end The SBS algorithm has been applied to the square random-dot stereograms of Fig. 1. Fig. 3 and Fig. 4 show the disparity maps computed by SBS and the estimated uncertainty maps (the darker the lower) in both cases. The estimated MAE is negligible and may be ascribed to the subpixel accuracy only. The occluded points, shown in white in the uncertainty maps are recovered with 100% accuracy, in both cases.

5 Conclusions The computational techniques adopted in both human and machine stereo have been reviewed, as well as some of the results arising from the experimental observations. Further links emerge among the human and the arti cial sides of stereo fusion. Evidences from Psychophysics seem to point toward area-based mechanisms 9

12

12

10

10

8

8

6

6

4

4

2

2

0 150

0 150 150

100

150

100

100 50

50 0

Figure 3: Computed disparity map (left) by SBS for the square randomdot stereogram and its uncertainty (right). MAE is 0.019.

0

100 50

50 0

0

Figure 4: Left: Isometric plot of the disparity map in Fig. 2 left. Right: Isometric plot of the disparity map in Fig. 3.

underlying both depth and motion perception in humans. On the other hand, area-based algorithms have been proposed in the domain of machine stereo to provide robust, dense depth maps for object and motion reconstruction. Although the term \area-based" takes a slightly dierent meaning in the two domains, still it is interesting to note that robust fusion mechanisms are requested anyhow, involving an image point and its neighbourhood. Open problems in machine stereo concern the treatment of occlusions and depth discontinuities: the SBS algorithm does some steps further, but still the space scale (i.e. the window size) is a free parameter. A coarse-to- ne strategy would add a exible, scale-invariant mechanism to adapt the window size to the local image texture and disparity elds. Moreover, having segmented the occlusions, a relevant issue is how to exploit this valuable piece of information in the depth reconstruction process. Once more in the history of stereovision, a cross-disciplinary research eort appears promising.

References [1] D. Marr and T. Poggio. Cooperative computation of stereo disparity. Science, 194:283{287, 1976. [2] I. J. Cox, S. Hingorani, B. M. Maggs, and S. B. Rao. A maximum likelihood stereo algorithm. Computer Vision and Image Understanding, 63(3):542{ 567, 1996. [3] O. Faugeras, B. Hotz, H. Mathieu, T. Vieville, Z. Zhang, P. Fua, E. Theron, L. Moll, G. Berry, J. Vuillemin, P. Bertin, and C. Proy. Real-time correlation-based stereo: algorithm, implementation and applications. Technical Report 2013, Unite de Recherche INRIA Sophia-Antipolis, 1993. [4] P. Fua. Combining stereo and monocular information to compute dense depth maps that preserve depth discontinuities. In Proceedings of the International Joint Conference on Arti cial Intelligence, Sydney, Australia, August 1991. [5] Y. Yeshurun and E. L. Schwartz. Cepstral ltering on a columnar image architecture: a fast algorithm for binocular stereo segmentation. IEEE 10

[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

Transactions on Pattern Analysis and Machine Intelligence, 11:759{767, 1989. K. O. Ludwig, H. Neumann, and B. Neumann. Local stereoscopic depth estimation. Image and Vision Computing, 12:16{35, 1994. M.J. Hannah. A system for digital stereo image matching. Photogrammatic Engineering and Remote Sensing, pages 1765{1770, 1989. P. Anandan. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2:283{ 310, 1989. M. Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4):353{363, 1993. S. S. Intille and A. F. Bobick. Disparity-space images and large occlusion stereo. In Jan-Olof Eklundh, editor, European Conference on Computer Vision, pages 179{186, Stockholm, Sweden, 1994. Springer-Verlag. T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window: Theory and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9):920{932, 1994. D. Geiger, B. Ladendorf, and A. Yuille. Occlusions and binocular stereo. International Journal of Computer Vision, 14:211{226, 1995. H. K. Nishihara. PRISM, a pratical real-time imaging stereo matcher. A.I. Memo 780, Massachusetts Institute of Technology, 1984. D. G. Jones and J. Malik. Computational framework for determining stereo correspondence from a set of linear spatial lters. Image and Vision Computing, 10(10):699{708, 1992. J. Weng, N. Ahuja, and T.S. Huang. Matching two perspective views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(8):806{825, 1992. D. Marr and T. Poggio. A theory of human stereo vision. A.I. Memo 451, Massachusetts Institute of Technology, November 1977. W.E.L. Grimson. Computational experiments with a feature based stereo algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(1):17{34, 1985. S. B. Pollard, J.E.W. Mayhew, and J.P. Frisby. PMF: A stereo correspondence algorithm using a disparity gradient constraint. Perception, 14:449{ 470, 1985. H. H. Baker and T. O. Binford. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Arti cial Intelligence, 1981. Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline search using dynamic programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(2):139{154, 1985.

11

[21] G. Medioni and R. Nevatia. Segment-based stereo matching. Computer Vision, Graphics, and Image Processing, 31:2{18, 1985. [22] S. T. Barnard and W. B. Thompson. Disparity analysis of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(4):333{340, 1980. [23] M. R. M. Jenkin, A. D. Jepson, and J. K. Tsotsos. Techniques for disparity measurements. CVGIP: Image Understanding, 53(1):14{30, 1991. [24] M. R. M. Jenkin and A. D. Jepson. Recovering local surface structure through local phase dierence measurements. CVGIP: Image Understanding, 59(1):72{93, 1994. [25] R. D. Henkel. Hierarchical calculation of 3d-structure. Technical Report 5/94, Zentrum fur Kognitionswissenschaften, Universitat Bremen, 1994. [26] P. N. Belhumeur. A bayesian approach to binocular stereopsis. International Journal of Computer Vision, 19(3):237{260, 1996. [27] C. Tomasi and R. Manduchi. Stereo without search. In European Conference on Computer Vision, pages 452{465, 1996. [28] R. Blake and H. R. Wilson. Neural models of stereoscopic vision. Trends in Neuroscience, 14:445{452, 1991. [29] R. C. Bolles, H. H. Baker, and M. J. Hannah. The JISCT stereo evaluation. Technical report, SRI International, January 1993. [30] U. R. Dhond and J. K. Aggarwal. Structure from stereo { a review. IEEE Transactions on Systems, Man and Cybernetics, 19(6):1489{1510, 1989. [31] K. N. Ogle. Researches in Binocular Vision. Saunders, Philadelphia, 1950. [32] K. N. Ogle. Disparity limits of stereopsis. Archives of Ophthalmology, 48:50{60, 1952. [33] J. P. Frisby and J. E. W. Mayhew. Contrast sensitivity function for stereopsis. Perception, 7:423{429, 1978. [34] D. L. Halpern and R. Blake. How contrast aects stereoacuity. Perception, 17:483{495, 1988. [35] A. Arditi. Binocular vision. In K. R. Bo, L. Kaufman, and J. P. Thomas, editors, Handbook of perception and performance: Vol. 1., Sensory processes and perception, pages 23.1{23.41. Wiley, New York, 1986. [36] G. Westheimer. The Ferrier lecture, 1992. Seeing depth with two eyes: stereopsis. Proc. R. Soc. Lond. B, 257:205{214, 1994. [37] S. P. McKee, D. M. Levi, and S. F. Bowne. The imprecision of stereopsis. Vision Research, 30:1763{1779, 1990. [38] G. Westheimer. Spatial interaction in the domain of disparity signals in human stereoscopic vision. Journal of Physiology, 370:619{629, 1986. 12

[39] C. W. Tyler. Stereoscopic vision: Cortical limitations and a disparity scaling eect. Science, 181:276{278, 1973. [40] C. W. Tyler. Depth perception in disparity gratings. Nature, 251:140{142, 1974. [41] C. W. Tyler and B. Julesz. On the depth of the cyclopean retina. Experimental Brain Research, 40:196{202, 1980. [42] Y. Hermush and Y. Yeshurun. Spatial gradient limit on perception of multiple motion. Perception, 24:1247{1256, 1995. [43] O. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. The MIT Press, Cambridge, 1993. [44] G. Xu, S. Tsui, and M. Asada. Coarse-to- ne strategy for matching motion stereo pairs. In Proceedings of the International Joint Conference on Arti cial Intelligence, pages 892{894, 1985. [45] A Fusiello, V. Roberto, and E. Trucco. Ecient stereo with multiple windowing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, S.Juan, Puerto Rico, June 1997. IEEE Computer Society Press. To appear.

13