Automatic Face Recognition: What Representation? - CiteSeerX

3 downloads 46736 Views 294KB Size Report
Email: [email protected] ... Email: [email protected] ... the comparison with a shape-free template-based approach, and the demonstration.
Automatic Face Recognition: What Representation? Nicholas Costen

ATR Human Information Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan Telephone: +81 (0)774 95 1044 Fax: +81 (0)774 95 1008 Email: [email protected]

Ian Craw 

Department of Mathematical Sciences, University of Aberdeen, Aberdeen AB9 2TY, Scotland. Telephone: +44 (0)122 427 2752 Fax: +44 (0)122 427 2607 Email: [email protected]

Shigeru Akamatsu

ATR Human Information Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan Telephone: +81 (0)774 95 1030 Fax: +81 (0)774 95 1008 Email: [email protected]

CATEGORIES: Shape and Object Representation, Object Recognition.

Abstract

We describe a testbed used to investigate di erent codings for automatic face recognition. An eigenface coding of shape-free faces, using manually coded landmarks was more e ective than the corresponding coding of correctly shaped faces. The advantage for shape-free faces was found re ected and depended upon highquality representation of the facial variation via a ensemble of shape-free faces. Con guration also proved an e ective method of recognition, with rankings given to incorrect matches relatively uncorrelated with those from shape-free faces. Both sets of information combine to improve signi cantly the performance of either system. Manipulation within the coding to emphasize distinctive features of the faces, by caricaturing, allowed further increases in performance; this e ect was noticeably larger when the independent shape-free and con guration coding was used. The addition of a system which directly correlated the contours of shape-free images also signi cantly increased recognition, suggesting extra information was still available. The coding of the con guration allowed automatic measurement of the face-shape using an active shape model. Taken together, these results strongly support the suggestion that faces should be considered as lying in a high-dimensional manifold which is linearly approximated by these two factors, possibly with a separate system for local features. 

This work was in part supported by EPSRC Grants Numbers GR/H75923 and GR/J04951

i

Automatic Face Recognition: What Representation?

Abstract We describe a testbed used to investigate di erent codings for automatic face recognition. An eigenface coding of shape-free faces, using manually coded landmarks was more e ective than the corresponding coding of correctly shaped faces. The advantage for shape-free faces was found re ected and depended upon highquality representation of the facial variation via a ensemble of shape-free faces. Con guration also proved an e ective method of recognition, with rankings given to incorrect matches relatively uncorrelated with those from shape-free faces. Both sets of information combine to improve signi cantly the performance of either system. Manipulation within the coding to emphasize distinctive features of the faces, by caricaturing, allowed further increases in performance; this e ect was noticeably larger when the independent shape-free and con guration coding was used. The addition of a system which directly correlated the contours of shape-free images also signi cantly increased recognition, suggesting extra information was still available. The coding of the con guration allowed automatic measurement of the face-shape using an active shape model. Taken together, these results strongly support the suggestion that faces should be considered as lying in a high-dimensional manifold which is linearly approximated by these two factors, possibly with a separate system for local features.

ii

Summary Page

I. The major original contributions of this work are comprehensive testing of the shapefree and con guration representations appropriate for recognition of human faces, the comparison with a shape-free template-based approach, and the demonstration that these codings can be used to reproduce human face-recognition phenomena, in particular caricaturing. II. It is important because it con rms the advantage of using "shape-free" faces, rst suggested on theoretical grounds; in general, this implies taking account of the methods of variation of objects, coding them in terms of the objects, not in terms of images of the object. It also draws attention to caricatures in the context of machine vision. III. The most closely related work is Lanitis, Taylor and Cootes (1994); however, this uses variation in a speci c set of faces which are then recognized, and a number of examples of each such face are used in training. In contrast, this paper uses the general variation in a separate collection of faces; recognition is thus based on a single example. IV. Other researchers can make use of this work both to improve recognition rates when distinqishing between con gural objects, and in modelling human recognition. V. This paper has also been submitted to the European Conference on Computer Vision. Preliminary work on testing, with early results was presented at IWAFGR'95 in Zurich in June 1995. iii

1 Aims In machine based face recognition, a gallery of faces is rst enrolled in the system and coded for subsequent searching. A probe face is then obtained and compared with each coded face in the gallery; recognition is noted when a suitable match occurs. The challenge of such a system is to perform recognition of the face despite transformations, such as changes in angle of presentation and lighting, common problems of machine vision, and changes also of expression and age which are more special. The need is thus to nd appropriate codings for a face which can be derived from (one or more) images of it, and to determine in what way, and how well two such codings shall match, before the faces are declared the same. A number of face recognition systems have become available in the laboratory recently which propose solutions to these problems, and a natural concern has been the overall performance of the system (Turk and Pentland 1991, Edelman, Reis eld and Yeshurun 1992, Lades, Vorbruggen, Buchmann, Lange, v. d. Malsburg, Wurtz and Konen 1993, Brunelli and Poggio 1993b, Pentland, Moghaddam and Starner 1994, Lanitis et al. 1994). Accordingly, test sets have been constructed and a recognition accuracy computed. In practice, published recognition results are very good, but notoriously dicult to compare (Robertson and Craw 1994). Although the choice of coding and matching strategies di er signi cantly between systems, the greatest source of variability is probably the least relevant; the selection of the particular collection of faces on which to carry out tests, and in particular, the choice of transformation between target and probe over which the system is supposed to perform recognition. The FERRET database may eventually provide a standard, but is currently only available for use within the USA. In this paper we seek to avoid some of these diculties by xing a matching strategy and a testing regime, and concentrating on the rst of the problems just discussed; to nd e ective codes for recognition. Our concern is then no longer how well we can recognise; indeed for our purposes, a testing regime with a low recognition rate is of most interest: 1

our interest instead is in comparing di erent coding strategies.

2 Principal Component Analysis In this paper our main concern is to contrast simple image-based codings with eigenface codings, derived from Principal Component Analysis. Eigenface codings were used to demonstrate pattern completion in a net based context (Kohonen, Oja and Lehtio 1981, Page 124), representing faces economically (Kirby and Sirovich 1990), and explicitly for recognition (Turk and Pentland 1991). Much subsequent work has been based on eigenfaces, either directly, or after preprocessing (Craw and Cameron 1992, Shackleton and Welsh 1991, Pentland et al. 1994, Lanitis et al. 1994). While undoubtedly successful in some circumstances, the theoretical foundation for the use of eigenfaces is less clear. Formally, Principal Component Analysis assumes that face images, usually normalised in some way, such as co-locating eyes to make them comparable, are usefully considered as (raster) vectors. A given set of such faces (called an ensemble here), is then analysed to nd the \best" ordered basis for their span. Some psychological theories of face recognition have such a norm-based coding as their starting point; it is thus natural to consider such theories in more detail. It appears that an appropriate model may be a \face manifold" (Craw 1995), and the usual normalisation is then seen as a local linear approximation, or chart, for this manifold. Since a chart is a locally di eomorphism, and has its range in a linear space, the average of two suciently close normalised faces should also be a face. Clearly existing normalisation techniques approximate this property; it can be argued that, since Principal Component Analysis itself is a linear theory, this is precisely why they are useful. Other more elaborate normalisation techniques can be identi ed, one of which was described by Ullman (1989), and has recently become prominent as the way in which a \morph" between two faces is performed. Landmarks are located on each face to 2

provide a description of the face shape or con gural information; there is a natural way to average landmark positions, and then to map an average face texture onto the resulting shape. More details of such a normalisation before Principal Component Analysis are given in Craw and Cameron (1992); we describe it as a decomposition into a shape vector or con guration and a shape-free, or texture, vector. A very similar decomposition, also to assist Principal Component Analysis coding, is given in Choi, Okazaki, Harashima and Takebe (1990). The main aim of our paper is to show that this more elaborate coding can produce signi cantly better recognition results. In Lanitis et al. (1994) results are presented whose motivation is very like our own. Con guration and texture are available separately, coded using Principal Component Analysis, and it is shown that the combination was more e ective than either alone. However they choose di erent images of the same faces as ensemble, and as such address neither the more general coding issue, essential for larger collections of images, nor the problem of recognition from a single example.

3 Methodology Our methodology starts with face images on which a collection of landmarks have been located. Our rst tests are with manual location; automatic location of landmarks, and the corresponding recognition results are discussed in Section 5. Eigenfaces are computed from an ensemble of faces which have no further r^ole in the process; the gallery and probe faces are then coded in terms of these. For a given probe, there is exactly one target | another image of the probe face | in the gallery, and our interest is in when the target best matches the probe.

Images: We work with images of size 128  128, writing N for the number of pixels in each image (so initially N = 16384) and n for the number of images in the ensemble; 3

in our case, n = 50 or n = 100. A total of 14 images of each of 27 people provide our test material. Each image was acquired under fairly standardised conditions; we refer to them as Condition 1 through Condition 14. An initial set of 10 images of each person was acquired on a single occasion. Those in Condition 1 through Condition 4 were lit with good at controlled lighting, with acquisition times a few seconds apart. Later conditions have increasingly severe lighting variations as well. In Fig. 1 we indicate something of the variability involved.

Figure 1: Conditions 1,4,5, 8 and 10. A subsequent set of four images of each of these 27 faces was acquired between one and eight weeks afterwards: the rst such, Condition 11, in lighting conditions similar to those obtaining for Condition 1; subsequent ones in increasingly di erent conditions. Condition 14 is the only image to be lit with a signi cant amount of natural, and so uncontrolled, light. In Fig. 2 we show the same subject as in Fig. 1 in each of these remaining conditions.

Figure 2: Conditions 11,12,13 and 14. The 27 images in Condition 1 provide our gallery which remains xed throughout. The 4

decision to eliminate condition variation in the gallery was a deliberate simpli cation. The remaining 13 images of each subject provide our probes; this gives 27  13 or 351 potential probes, each with a corresponding target in the gallery. Using each of the 27 faces as probe avoids the possibility that faces in the gallery not used as targets may be hard to recognise: we do this except when calculating acceptance parameters, when a gallery with no target is required; in that case we used a gallery of 26 faces. Rather than pool results over condition number, we keep the conditions distinct, expecting essentially perfect recognition from images in Conditions 2 and 3; those in Condition 14 provide a more varied test. An additional 50 faces were collected only in Condition 1 and are used as ensemble images, from which the eigenfaces are generated. Thus the ensemble and the gallery and probes are mutually exclusive, there is no training set per se and recognition is based on a single target. This approach di ers from that employed when eigenfaces are used for representation, and reconstruction is done using only a few of the early eigenfaces. In our tests we use all the eigenfaces, although in Section 6 we discuss the e ect of ignoring some. Examples of images in the ensemble are shown in Fig. 3.

Figure 3: Ensemble images before processing. Certain tests were performed using an ensemble of non-facial images. The corresponding eigenimages are essentially psychophysically useful two dimensional derivative-of-Gaussian lters which can be used for face-recognition (Rao and Ballard 1995). A collection of 50 scenes (resembling holiday snaps) were selected at random. Although the images here are typically more detailed than the faces, Hancock, Baddeley and Smith (1992) found that 5

the nature of the principal components was not a ected the scale of these images. Thus the particular selection of images is relatively arbitrary.

Processing: Each image is processed in the same way, before being used in the ensemble, gallery or as a probe. A total of 34 landmarks, both true and de cient (e.g. the edge of the chin \half way" between two true landmarks), were found manually on each image, giving a triangulation, or face model, part of which can be seen in Fig. 5. A (uniformly) scaled Euclidean transformation of the image is derived to minimise the error between the actual positions and the corresponding points on a reference face, here the average of the ensemble faces, retaining the aspect ratio of the face. Such images are called normalised; this removes the e ect of image variation associated with di erent camera locations and orientations, and is an alternative to positioning subjects carefully before the images are acquired. The background can then be identi ed and has no further r^ole in the process; the remaining pixel values are adjusted so the resulting histogram is as at as possible. Greater sensitivity is attained when our data have zero mean; to give this, the average image is calculated, and subtracted from each member of the ensemble; in practice (n ? 1) eigenfaces are then available. The complete normalisation processing is illustrated in Fig. 4, although for display purposes, the ensemble average has not been subtracted.

Figure 4: Processed gallery faces. When a face image includes signi cant portions of the hair, the available featural information can often give good short term recognition results. However the hair is not invariant 6

over periods of months during which a practical system must maintain a useful recognition performance. To avoid this problem, we concentrate on a smaller part of the face whose appearance is more invariant; the available landmark data enables such an image, containing \inner features" only, to be extracted, as in Fig. 5. In the example given, the mask has been expanded slightly to display the facial locations and connecting triangulation better; in the results the border co-incides with the outer black line. Essentially all the results we report are for such images in which the hair has been excluded. These images have N = 2557 pixels; in contrast, the full face image of Fig. 4 has N = 5533. In order to ensure comparability, our non-face images were processed in the same ways as the face-ensemble. There is no useful way that these images can be normalised; however we randomly associated them with the 50 ensemble models. Since the point-locations were essentially arbitrary, the resulting distortions caused by normalising these images did not change the relationship between the images, but ensured Figure 5: Inner face showing the facial locations. they were processed by the same programs.

Coding and Matching: The resulting normalised ensemble is subjected to a Principal Component Analysis, in which eigenvalues and unit eigenvectors (or eigenfaces) of the image cross-correlation matrix are obtained. The orthonormality of the eigenfaces means it is simple to compute the component of any (normalised) face in the direction of each eigenface, and hence obtain an (n ? 1)-tuple or code. A coded probe image is then compared with each gallery code to determine the best match. One way to do this uses nearest neighbour matching in Rn?1, the span of the ensemble, and a natural choice of metric is the usual Euclidean distance. Since our basis of Rn?1 is orthonormal in RN , this is just the usual Euclidean metric in RN ; and such recognition is e ectively template matching. Another natural choice of metric on Rn?1 is the Mahalanobis distance, in 7

X

which d(x; y)2 = ?i 1(xi ? yi)2, where fig is the sequence of eigenvalues. This treats variations along all axes as equally signi cant, arguably appropriate since our aim is discrimination rather than representation. A more robust scheme balances false acceptances with false rejections, and allows the possibility of no match being acceptable. One such, (Lades et al. 1993) has a match score cj between each image in the gallery, and the probe image. The best match corresponds to the lowest score, and interest centres on the sequence fcj g, together with the lowest value c0 and the next lowest value c1 . The mean  and standard deviation  of the sequence obtained by removing the target image from the gallery are calculated and used to de ne two statistics, r1 = c1 ? c0 and r2 =  ? c0 , with associated thresholds t1 and t2; a match   is accepted if r1 > t1 and r2 > t2 and otherwise rejected. We adopt this, reporting a correct match as a clear hit if the target passes this acceptance or separation criterion, and just a hit otherwise, with a similar terminology for misses. To set thresholds, the distances between the probes in Condition 2 and the gallery images with the target deleted were found. The two statistics, r1 and r2 were then calculated for each probe and the largest values independently chosen as t1 and t2 . This procedure ensured that in the best, base, condition there was no false recognition; although particularly conservative, there are cases (eg Table 7) where \clear misses" do occur.

4 Results We group Conditions 2, 3 and 4 together, and describe this as \Immediate" recognition. Conditions 5, 6 and 7 form a very similar set with a small change in lighting and position, and these are described as the \Variant" group. More fundamental lighting changes distinguish Conditions 8, 9 and 10, and these are combined as the \Lighting" group. Finally the four conditions in which the images were acquired after a delay, are grouped together as the \Later" set. To give a feel for overall performance, the four groups have been combined 8

in an \Overall" group. Although more images are available in the sets with low condition numbers, greater interest attaches to the performance of the \Later" group, and accordingly we weight this latter group more heavily. The weights used are given in Table 1, together with the contribution that a single trial makes to the overall results. Trials Weight Individual Immediate 81 1 0.11% Variant 81 1 0.11% Lighting 81 2 0.21% Later 108 4 0.53% Table 1: Weights used for overall performance. Our main interest is in the comparison between scaled Euclidean normalisation, and the more intrusive shape-free form; and the contrast between these two and a pure correlation approach. However we rst discuss other choices which make up our testing regime.

Ensemble Size: Initial testing was done using the ensemble of 50 faces described above. It is not clear that an ensemble of 50 faces is adequate, but gathering more images and subsequent landmark location was inconvenient. We thus borrowed an idea from Kirby and Sirovich (1990) making use of vertical symmetry, or rather the lack of it, in individual faces, by creating 50 \mirror" faces, whose images and landmarks were created by re ection about the vertical facial mid line. The resulting improvement in recognition shown in Table 2 was suciently noticeable to suggest that the original ensemble was indeed too small, and all subsequent tests are reported with this \doubled" ensemble. It is of interest that recognition also increased when we re ected the non-face ensemble in the same way. Our \baseline" recognition in Table 3 gives results against which subsequent performance is to be compared, and was obtained using all 99 eigenfaces from this enlarged ensemble. Pixel value normalisation, as throughout, is by histogram equalisation. Other pixel value normalisation methods were tested. These including just setting the length of the image vector to be constant (more useful when the average was not subtracted, 9

50 (50  2)=2 50  2 Immediate 95.1 96.3 98.8 86.4 82.8 87.7 Variant Lighting 56.8 60.5 71.6 Later 57.4 64.8 71.3 Overall 64.4 69.2 76.1 Table 2: Hit percentages from 351 trials. Scaled Euclidean normalised, matching with Mahalanobis distance. Hair has been excluded from the match. Comparison between an ensemble with 50 faces, the rst 50 eigenfaces from the \doubled" ensemble of 100 images (50 faces and their mirrors), and the full \doubled" ensemble. which can in e ect negate this preprocessing, but still with a total of 15 fewer hits than histogram equalisation), using an edge image (see page 16), an H-transform (Watt 1994) and restricting to psychologically important spatial frequencies (Costen, Parker and Craw 1994); these last three gave signi cantly worse results. Hit Miss Clear Just Just Clear Immediate: 90.1 9.9 0.0 0.0 Variant: 67.9 22.2 9.9 0.0 Lighting: 17.3 48.1 34.6 0.0 Later: 34.3 32.4 33.3 0.0 Overall: 40.2 32.3 27.5 0.0 Table 3: Match percentages from 351 trials. Scaled Euclidean normalised, matching with Mahalanobis distance. Hair has been excluded from the match.

Mahalanobis or Euclidean? Our rst comparison is between Table 3 and the same set of tests in which the match is based on Euclidean distance, given in Table 4. The use of the Mahalanobis distance is clearly more e ective, con rming that the eigenface formulation, with its variance properties, is worthwhile here, and that we not just using the orthogonality properties of the basis. Note also that the advantage is least evident in the \Immediate" group, where simple template matching is expected to perform well; but that even in this case, the e ect on the separation of weighting the later components is noticeable. The comparisonis very similar when the hair is included in the image area. 10

Hit Miss Clear Just Just Clear Immediate: 82.7 14.8 2.5 0.0 Variant: 34.6 45.7 19.5 0.0 Lighting: 3.7 29.6 66.7 0.0 Later: 16.7 28.7 54.9 0.0 Overall: 22.9 29.2 47.9 0.0 Table 4: Match percentages from 351 trials. Scaled Euclidean normalised, matching with Euclidean distance. Hair has been excluded from the match.

PCA or just matching? Although this suggests an advantage for the Mahalanobis distance over the Euclidean distance, each of these uses only that information in the relevant images which is preserved when the images are projected onto the span of the ensemble. In this projection idiosyncratic information, which could aid recognition, may be lost. Thus a more appropriate baseline may be one using the whole of the relevant image information. To investigate this, a template-based recognition procedure was implemented using the whole of the (masked) face image. Matching was done on the basis of the best correlation between the probe and gallery, both normalised by a scaled Euclidean transformation derived from the landmarks. This yields recognition performance given in Table 5. Hit Miss Clear Just Just Clear Immediate: 31.3 7.4 1.2 0.0 Variant: 55.6 35.8 8.6 0.0 Lighting: 11.1 42.0 46.9 0.0 Later: 23.1 40.7 36.1 0.0 Overall: 31.3 36.9 31.7 0.0 Table 5: Match percentages from 351 trials. Scaled Euclidean normalised, matching by correlation of the images. Hair has been excluded from the match. This can be compared with either Table 4 or Table 3. It is clear that signi cant information is lost by moving to a facial norm (ie coding using just the vocabulary provided by the ensemble); this is balanced in part by an improved resistance to lighting changes. 11

Matching using the Mahalanobis distance more than makes up for this loss and we thus adopt this as our baseline in what follows.

Shape-free Normalisation: The theoretical considerations in Section 2 suggest that the decomposition of a face into a shape-free or texture vector, and the con guration or shape vector of landmark locations, may provide more e ective coding for recognition. We discuss rst the case in which only the shape-free face is used, deliberately ignoring con guration. Thus our normalisation, rather than using a scaled Euclidean transformation, texture maps each face to a standard shape, in this case the average shape of the set of ensemble images. We used linear interpolation based on the model in Fig. 5; although simpler than Bookstein's thin plate spline warps (Lanitis et al. 1994), we found the procedure more e ective. As usual, the pixel values are then normalised by histogram equalisation. The results given in Table 6 are directly comparable to Table 3; only the treatment of the image shape is di erent. Hit Miss Clear Just Just Clear Immediate 95.1 4.9 0.0 0.0 Variant 64.2 29.6 6.2 0.0 Lighting 18.0 51.9 29.6 0.0 Later 28.7 46.5 25.0 0.0 Overall 37.4 41.3 21.3 0.0 Table 6: Match percentages from 351 trials. Shape-free normalised, matching with Mahalanobis distance. Hair is excluded from the match. The comparison suggests that shape-free normalisation appears to be slightly better than the scaled Euclidean version, despite the fact that the shape information has been deliberately ignored. It may be that we have implemented the scaled Euclidean normalisation inappropriately, but other proceduresincluding one in which an appropriate ellipse was generated to mask the exterior features of the face and subsequently scaled to a xed size, were tested and proved less e ective. 12

Matching or representing? It is possible that the shape-free advantage, seen between Tables 3 and 6, re ects superior matching of the distorted images, rather than superior coding. These are confounded, since the coding and normalisation were performed together. A related issue is whether an advantage accrues from coding faces in terms of naturally occurring facial variation. Since our implied psychological model suggests that this coding occurs after the faces have been recognised as such, we would hope to only get an advantage for the shape-free faces when they are coded in terms of faces. This second issue was addressed using our non-face ensemble described in Section 3. This was processed in the same way as the face ensemble: any di erences in coding e ectiveness should arise because of the content of the ensemble rather than for more trivial image-based reasons. Coding and normalisation procedures were contrasted by using shape-free normalisation on the ensemble combined with a scaled Euclidean normalisation on the gallery and probes, together with the reverse. The results are shown in Fig. 6 for the face-ensemble, and Fig. 7 for the non-face ensemble. While it is clear that for the face-ensemble, the determining factor is the ensemble standardisation method, the non-face ensemble only shows an e ect of the probe standardisation method. In both cases there is evidence of an interaction between the two factors, re ecting slight di erences in the pixel grey-level interpolation algorithms for transforming the images. These results suggest that the advantage for shape-free-faces re ects superior representation of the faces, not just better matching. They also suggest that the (noticeably worse) recognition seen when the non-faces are used for an ensemble should be thought of as picture-based, rather than face-based.

Spatial frequencies: an alternative coding? Further evidence of the source of the advantage for shape-free faces may come by a direct investigation of eigenfaces. One useful technique is Fourier analysis, since this provides an orthogonal decomposition, comparable with the orthonormal eigenface basis, in providing a holistic coding of the input, but 13

80 Shape-free images Euclidean images

Shape-free images Euclidean images

64

62 Hit-rate (percent)

Hit-Rate (percent)

78

76

74

60

58 72 56 70

Shape-free

Euclidean

Shape-free

Face Ensemble Standardisation

Euclidean Non-face Ensemble Standardisation

Figure 6: Recognition using an ensemble of faces with di erent standardisation methods for ensemble and gallery and probes: hit rates for Euclidean-normalised and shape-free images.

Figure 7: Recognition using a non-face ensemble with di erent standardisation methods for ensemble and gallery and probes: hit rates for Euclidean-normalised and shape-free images.

di ering in that it codes the image in terms of general, image-based characteristics, rather than characteristics based on appriate variances; in our case of an ensemble of faces. When faces are reconstructed using only a restricted band of frequencies, human recognition is found to depend almost entirely upon a band between 3 and 20 cycles per face width (Costen et al. 1994). A comparable reconstruction of (scaled) Eucldean normalised faces, using a limited band of eigenfaces, results in images which appear to be: primarlily low-frequency shape-variable ones, when the pass-band concentrates on early eigenfaces; and high-frequency texture-variable ones when the pass band concentrates on later eigenfaces (O'Toole, De enbacher, Valentin and Abdi 1994). This is perhaps to be expected, since both lower frequencies and earlier-extracted eigenfaces will account for greater amounts of variance than later ones. Examples of frequency spectra of eigenfaces are shown in Fig. 8. These use a shape-free ensemble with the average included. Since Fourier analysis is done most easily on square images, average intensities were substituted outside the area of oyur normal mask. This ensured the image and facial co-ordinates were consistent. Although the rst component (the average face) has the approximately logarithmic decline expected of real images, the 14

other eigenfaces have band-passed pro les, with noticeable amounts of noise. This noise means that a relationship between the spatial frequency and eigenface ranks is not directly apparant; we now show that one is revealed by using an appropriate averaging procedure. For each eigenface, the spatial frequencies were ranked by energy, and the wave-lengths of the most energetic frequencies averaged. These wavelengths were then correlated with eigenface rank. Increasing the number of components used to calculate the "mean energetic wavelength" reduces the e ect of the noise: this allows investigation of the frequency selectivity of the components; highly selective components should have high correlations which quickly reduce as the number of components averaged increases. 1 Eigenface 1 Eigenface 20 Eigenface 40 Eigenface 60 Eigenface 80

Shape-free Faces Euclidean Faces Non-face Images 0.8

0.0001

Spearman Correlation

Energy per Frequency Component (arbitary)

0.001

1e-05

0.6

0.4

0.2

1e-06

0 1

2

4

8 16 32 Spatial Frequency (cycles per image)

64

128

0

Figure 8: Examples of the spatial frequency spectra of the Principal Components of shape-free inner faces.

10

20 30 40 50 60 70 Number of Intensity-Ranked Frequencies Averaged

80

90

Figure 9: Variation in the relationship between eigenface rank and spatial frequency energy for shape-free, Euclidean and nonface Principal Components

Figure 9 shows the results of this manipulation using three ensembles, with shape-free, Euclidean normalised and non-face images. The number of spatial frequencies averaged was varied between 1 and 90 and the Spearman rank correlations taken. All three correlations peak at a high, positive value with 4 to 8 frequency components averaged, but then the behaviour of the three diverges. It appears that while non-faces are highly selective to spatial frequencies, the face-based components are less so; and that selectivity decreases as knowledge of the shape of the face increases. This in turn suggests that the relationship 15

between spatial-frequency and principal component rank previously noted is an artifact; and that the shape-free advantage rests upon coding the faces in terms of their facial variation, which may change in size, rather than their image-qualities.

Shape Data: Since the shape-free normalisation discards information on the shape of the face, recognition may be enhanced by independent consideration of the shape, performing Principal Component Analysis on the landmark locations. This was done as already described, rst applying a scaled Euclidean transformation to remove accidental position e ects, and then, if necessary, removing the points relating to the hair. The shapes of the ensemble images then provided suitable principal components (we reserve \eigenface" for texture components) as descriptors. There are a maximum of 46 degrees of freedom in the shape data derived from 34 landmarks, but it is to be expected that these data are highly correlated. This was borne out during the Principal Components Analysis, when the eigenvalues became small after the rst 15 or 20 eigenshapes had been derived. The number of principal components used to code the shape vector was thus varied and the associated hit rates are shown in Fig. 10 for the doubled ensemble with and without hair. These show that in both cases, recognition peaks when 20 components are included in the analysis; the with-hair images are better than the no-hair images with more than 20 components. The summary statistics for the doubled ensemble with 20 components used to code the con guration are shown in Table 7. It may be that some of the recognition is a result of variation in camera optics, which were unfortunately not controlled between subjects. However, the relatively good `Later" performance argues against this.

Instance-based codes? The above tests show that there is a real advantage in representing faces in terms of shape-free texture. We are thus led to ask if it is also an advantage for instance-based recognition, in which matches are performed on the images themselves 16

90 With Hair Without Hair

Correct Best Matches (percent)

85 80 75 70 65 60 55 50 45 40

5

10

15 20 25 30 35 Principal Components used for Coding

40

45

Figure 10: Variation in correct classi cation from shape with available principal components. Hit Miss Clear Just Just Clear Immediate: 39.5 46.9 13.6 0.0 Variant: 23.5 58.0 17.3 1.2 Lighting: 27.2 54.3 18.5 0.0 Later: 19.4 59.3 21.3 0.0 Overall: 23.7 56.7 19.4 0.1 Table 7: Match percentages from 351 trials. Shape only, matching with Mahalanobis distance on the 20 eigenvectors with the largest variance. Hair has been excluded from the match. and thus to seek the analogue for shape-free faces of Table 5. Because our normalisation methods use a relatively small number of points, the quality of the match between images may be underestimated; to compensate, the correlation between a probe and each gallery image was optimized separately by choosing the scaled Euclidean transform (assumed already very close to the identity) which maximized the image correlation. We found a signi cant e ect of the type of image-processing used; there was a very noticeable advantage for preprocessing using a laplacian transformation, a 3  3 matrix with a positive centre and zero corners, often thought of as a sharpening operator. The complete recognition rates for the shape-free laplacian images are given in Table 8, showing very good and constant recognition. It should however be noted that this is very 17

slow even with the relatively small gallery used here; optimizing the match meant that each image was compared with each gallery member 50 times. In contrast, the laplacian preprocessing is not useful for Principal Components Analysis; the shape-free transformation yields a correct hit-rate overall of 26.7 %. Hit Miss Clear Just Just Clear Immediate 85.2 14.8 0.0 0.0 67.9 30.9 1.2 0.0 Variant Lighting 38.3 50.6 11.1 0.0 Later 28.7 62.0 9.3 0.0 Overall 41.0 51.2 7.8 0.0 Table 8: Match percentages from 351 trials. Shape-free normalised, matching by full correlation of laplacian images. Hair is excluded from the match. These results again show the advantage of using a shape-free representation; in this case it ensures that all sections of the laplacian-processed images can be aligned at once. In contrast, when there is still shape left in the images, di erent sections of the probe face compete to match corresponding sections of the gallery images. Thus the ratio between the distances to the target and distractors will be reduced. However it is dependent on appropriate preprocessing to allow this matching; the slight drop in recognition when histogram equalisation, our \standard" method is used instead, should be expected given that the shape-free advantage for Principal Components Analysis appears to re ect better representation, rather than matching. Certainly the shape-free manipulation does remove information; these results serve to emphasize again that notably di erent operations are occurring in Principal Components Analysis and in correlation.

Shape and Texture: Coding using shape-free faces, and coding using just the face shape, each give reasonable recognition. If these two measures are relatively independent, an appropriate combination may be more e ective than either. This was investigated by applying Principal Component Analysis separately to the shape and shape-free images, 18

using the 20 most variable shape components, but all the texture eigenfaces derived from histogram-equalised images. Independence was assessed by measuring correlations of the ranks of the distances between each probe and the other images in the gallery (this helped to avoid outlier e ects). The average Spearman rank correlations are given in Table 9 and show that the correlations are positive but modest. This suggests that shape and texture describe dis-similar properties; the positive correlation may re ect a tendency for faces to be extreme in both measures. Immediate Variant Lighting Later 0.267 0.195 0.201 0.257 Table 9: Correlation on match rankings based on Mahalanobis distance for shape and texture from 351 trials. Hair has been excluded from the match. The shape and texture distances for each probe were combined using a root mean square, having rst rescaled the individual distances so the sum of each set was unity. The results in Table 10 are thus comparable with our baseline Table 3, but combine locally linearised shape with (shape-free) texture information. Hit Miss Clear Just Just Clear Immediate: 90.1 9.9 0.0 0.0 Variant: 71.6 25.9 2.5 0.0 Lighting: 40.7 50.6 8.6 0.0 Later: 42.6 46.3 11.1 0.0 Overall: 50.4 41.1 8.5 0.0 Table 10: Match percentages from 351 trials. Combined shape and texture measures, matching with Mahalanobis distance. Hair has been excluded from the match. Obviously, the combination of incompatible types of information is arbitrary. Two other methods were used; multiplying the pairs of distances, and excluding the worse half of the shape matches before matching on texture; each gave slightly worse performance. The number and range of texture-eigenfaces included in the coding was also varied, but this showed no clear pattern; variations in hit-rates were 1{2 %. 19

Caricaturing: The availability of distinct shape and texture components of the face allows other manipulations to enhance recognition. One such, rst demonstrated in the psychological arena, is caricaturing. We code face shape as a set of position vectors, each of which is the displacement of a landmark from the location of the corresponding landmark in the average face. Scaling the set of displacements uniformly by an amount k gives a caricatured shape, with k = 100% representing the veridical; a caricatured face is then created by texture mapping the face image to this shape. When shown to humans, familiar faces are recognised better with modest caricatures (less than about 150 %) for both linedrawings (Rhodes, Brennan and Carey 1987), and also warped grey-scale images (Benson and Perrett 1994). Image texture can be caricatured similarly by displacing the grey levels in a shape-free face away from the mean grey-level for that pixel. An example in which both the shape and texture of the image have been caricatured is shown in Fig. 11.

Figure 11: E ects of caricaturing an image at 41, 64, 100, 156 and 244 percent. Certain techniques automatically extract caricatures; Brunelli and Poggio (1993a) show that modest caricatures are extracted in a Radial Basis Function network operating on feature-distances. Typically the representations were about 110 % caricatures, as RBFs extract the most distinctive set of features. However, a Principal Components Analysis technique, with veridical representation, has greater freedom as it allows investigation of the coding giving the most e ective caricatures. Our rst set of recognition tests seeks to correspond to tests done on humans. We caricatured the shape of the faces by di erent amounts while leaving the texture intact and recognised via both shape and texture, using the whole face and deriving the thresholds 20

t1 and t2 for clear hits from the veridical images. This reproduces the situation in Benson

and Perrett (1994) and as Fig. 12 shows, gives a rather high peak at about 156 %. There are a number of possible, but trivial, explanations of why the e ect may be stronger here than in humans. The second test, Fig. 13, shows that under appropriate conditions, the e ect is constant at higher levels of caricaturing. The texture of the shape-free processed images was caricatured against the average of the processed ensemble. Since this is the norm of the Principal Components Analysis the angle of the vectors which describe the faces should not alter. The caricaturing e ect persists with little change to at least 900 %. 100 Confident recognition Correct best matches

100

Confident Hits Correct Best Matches 80

Recognition rate

Recognition rate

90 80 70

60

40

60 20 50 40

41

51

64

80 100 125 Caricature percentage

156

195

0

244

Figure 12: Recognition of shape-andtexture faces with shape caricatured. Hair has been included in the match.

11

33

100 Caricature percentage

305

931

Figure 13: Recognition of shape-free faces caricatured against the processed average. Hair has been excluded from the match.

The third and fourth tests show the e ects of recognizing images caricatured on shape and texture with a scaled Euclidean normalised Principal Components Analysis. This yields the notably smaller caricature e ect shown by Fig. 14, while independent shape and shape-free Principal Components Analysis, as shown in Fig. 15, give a strong caricature e ect with recognition peaking at about 150 %. The peak recognition rates are shown in Table 11. The images used for recognition in both these curves have both shape and texture independently caricatured against a di erent average face from that of the ensemble used for recognition. This ensures that the di erence in results is not just due to the di erence between the caricature and coding averages in the Euclidean case. This may slightly reduce 21

recognition in the independent shape and texture case. Hit Miss Clear Just Just Clear Immediate: 95.1 4.9 0.0 0.0 Variant: 80.2 18.5 1.2 0.0 Lighting: 48.1 46.9 4.9 0.0 Later: 58.3 34.3 7.4 0.0 Overall: 62.4 32.1 5.4 0.0 Table 11: Match percentages from 351 trials. Shape and texture, matching with Mahalanobis distance with the images caricatured to 156 % of their original value. Hair has been excluded from the match. Confident recognition Correct best matches

90

90

80

80

70 60 50 40

70 60 50 40

30

30

20

20

10

10

0

41

51

64

80 100 125 Caricature percentage

156

195

Confident recognition Correct best matches

100

Recognition rate

Recognition rate

100

0

244

Figure 14: Con dent and total hit rates for shape-and-texture caricatured faces, recognised by a Euclidean-normalised Principal Components Analysis. Hair has been excluded from the match.

41

51

64

80 100 125 Caricature percentage

156

195

244

Figure 15: Con dent and total hit rates for shape-and-texture caricatured faces, recognised by separate shape and texture Principal Components Analysiss. Hair has been excluded from the match.

This di erence in the size of the caricature e ect is only seen if the images are caricatured against the independent shape and texture averages. Caricaturing images against the average of the Principal Components Analysis, regardless of the type of normalisation used, gives approximately equal e ects. The speci c advantage of the shape-free manipulation is that it allows equivalent transformations in both image-space (as evidenced by the human data) and also in the linearisation where Principal Components Analysis is performed. 22

Shape, texture and shape-free correlation: We conclude the section with a nal result in which all three matching methods, shape, texture and shape-free correlation, are combined. Thus we recognise by combining (again using a root mean square) the evidence which led to Tables 6, 7 and 8. We work with images with the hair portion removed, and combine PCA-based representations derived separately from shape and shape-free texture each matched using Mahalanobis distance with an optimized correlation between shape-free images pre-processed with a laplacian lter. The results, in Table 12 suggest there remains relevant information which we have been unable to code using caricature techniques; but we again emphasize that the optimized correlation takes impractically long, and does not scale well for larger gallery sizes as the PCA based methods do. Hit Miss Clear Just Just Clear Immediate: 97.5 2.5 0.0 0.0 Variant: 88.9 11.1 0.0 0.0 Lighting: 67.9 29.6 2.5 0.0 Later: 66.7 32.4 0.9 0.0 Overall: 72.6 26.3 1.1 0.0 Table 12: Match percentages from 351 trials. Shape and texture, matching with Mahalanobis distance combined with laplacian pre-processed optimized correlation. Hair has been excluded from the match.

5 Facial shape- nding Facial location and orientation: If our proposed coding process is to be performed automatically we need such a way of locating the required landmarks. Ideally this would be done directly, but some landmarks, chosen to assist the passage to a shape-free face, are elusive. In practice, all that is needed initially is the accurate location of enough landmarks to provide a scaled Euclidean normalisation of a new face; we then generate re ned shape estimates sequentially. These initial landmarks are provided by a development of an earlier program, FindFace (Craw, Tock and Bennett 1992), which we describe brie y. 23

As a rst step, a collection of \contexts", each essentially a choice of scaled Euclidean transformation from the given image to a reference image is derived, which then allow the positions of other features to be predicted from statistical information. These contexts are generated by a number of simple feature nders, including eye, outline and nose nders. At this stage the aim is to ensure that the correct context is one of those selected. The contexts are then ltered by seeking additional features, applying known geometrical and statistical constraints, and nally accepting the most successful context. Although more features are involved in this ltering stage, only the resulting location of the eyes and mouth are reported. Testing such a system is notoriously dicult, but at present it appears to make few errors, failing on only two images from 129 now available in Condition 1, and to degrade gracefully on more dicult images.

Con guration determination: The contextual information is already enough to initialise a bootstrapping procedure to locate the remainder of the landmarks, assuming that the ensemble, and as such an average shape-free face is available. Each set of landmark locations on a face de nes a corresponding shape-free face: our method of landmark location is in principle simple; we choose those landmark locations on our new face for which the corresponding shape-free face has the highest image correlation with the average shape-free face. In other words, we choose landmarks so the resulting shape-free face will be most \face like". However to do the required optimisation eciently needs the Principal Components Analysis of face shape discussed above, and in particular, the consequent orthogonal decomposition of the shape. By tting these orthogonal components successively we have a very e ective means of navigating through the vast space of possible patterns; it enforces a tree-structure where the nature of the later decisions is not a ected by the earlier ones. We rely on FindFace for initial locations of the eyes of the probe and gallery images, and then perform position normalisation by co-locating eyes. Thus for consistency, a Principal 24

Components Analysis of the ensemble-shape was carried out based on this normalisation. Starting with the average model, new models were built by varying the shape according to the rst (shape) Principal Component over a range of up to two standard deviations, in essence applying an active shape model proposed by Cootes, Taylor, Cooper and Graham (1995). The resulting model, representing initial locations of all the landmarks, was used to distort the probe to the average shape. The normalised probe was then correlated against the average of the shape-free textures. Both images were histogram-equalised, but no average was removed. This correlation was used as a measure of the appropriateness of the model, and a simple hill-climbing algorithm implemented to get the best t: the model tted on the rst component was used as the starting point from which to t the second coecient and so on for the 20 most variable eigenfaces. In a similar context Lanitis, Taylor, Cootes and Ahmed (1995) tted the rst 16 components. The tting was performed upon the whole, masked, face, including the hair; this gave the most accurate and consistent point-de nitions. The other obvious convergence test, using as error measure the distance from the span of the (shape-free) ensemble yielded locations idiosyncratic to the face, while only manipulating the inner-face tended to give models with inner points inside the actual contour. Providing meaningful information about the accuracy of such locations is dicult; even a comparison with manually located landmarks is not necessarily useful, since the need is to achieve repeatability between gallery and probes, rather than agreement with a \correct" position. However an example of the output of this procedure is given in Fig. 5. Apart from the right eye, the only points which might cause diculty are those associated with the ears, which are badly de ned in this model. When the points found in this manner were used as the input to the complete system, including a caricature of 156 %, it provided the recognition values shown in Table 13. There was a signi cant caricature e ect; the recognition rate for veridical images was 65.7 %, with 34.7 % clear hits. 25

Hit Miss Clear Just Just Clear Immediate: 86.4 12.3 1.2 0.0 Variant: 67.9 24.7 7.4 0.0 Lighting: 37.0 34.6 24.2 1.2 Later: 30.6 37.0 32.4 0.0 Overall: 41.9 32.5 25.3 0.3 Table 13: Match percentages from 351 trials. Shape and texture, matching with Mahalanobis distance with the images caricatured to 156 % of their original value. Hair has been excluded from the match and the feature-locations determined automatically. This procedure is slow, and more local ways of using the available shape information which may increase location accuracy, are currently being investigated. It should be noted that the signi cant caricature e ect suggests that the locations are consistent across di erent images of the same face. Nevertheless the method we have described is of interest: its simplicity and e ectiveness in shape bootstrapping provides a natural procedure on very parallel machines in which the number of correlations and comparisons is not an obstacle, and thus a possible biological model.

6 Conclusions We have attempted to show that a greater consideration of the nature of Principal Component Analysis yields advantages in recognition. Doing so, moving from scaled Euclidean normalised images to the combined con guration and texture images produces a three-fold reduction in misses without adding extra information. Rather, the con gural information which appears to overshadow the texture in the scaled Euclidean normalised images, by requiring that facial features be approximated by a combination of di erent features from the di erent eigenfaces, has been treated separately so that it can have a positive e ect. The corresponding number of clear hits has increased, but not to the same degree. Caricaturing the images can improve this, by distorting them to emphasize their already atypical 26

aspects. This may not change the ordering of matches, but does increase the separation. Notably, this advantage for shape-free Principal Components remains even if the probe is not itself shape-free, reinforcing the conclusion that this is a representational advance. Conversely, if the image has not been identi ed as a face (or potentially a member of some other class of objects sharing a con guration), such a manipulation is not useful. The processing and thus the representation is dependent upon the task at hand, changing from a general-purpose, spatial-frequency selective coding to a special-purpose, non-frequencyselective coding as knowledge of facial shape increases. This observation is reinforced by the observation nding that unlike the scaled Euclidean normalisation, the shape and shape-free normalisation allows equivalent transformations in a human face-space and that provided by the Principal Components Analysis. This decomposition of the face rst into con guration and texture, and then into Principal Components also allows the ecient location of facial features. Although the locations are not currently as accurate as those found by hand, methods of increasing the accuracy are being investigated. The clear advantage for Mahalanobis distance over Euclidean distance, consistent across conditions, provides evidence that Principal Component Analysis is a more appropriate method of coding faces than simply using raw images; and that something more sophisticated than simple template matching is occurring. Since the Mahalanobis distance aims to pay equal attention to all components, we expect no particular band of eigenfaces to best code the images; once variability is taken into account, the eigenfaces should all have the same importance. Within reasonable limits, this was found; for this reason we have used all the eigenfaces in the tests described here. Overall we believe we have shown that Principal Component Analysis, implemented under the in uence of a manifold model of \face space", separating con gural and textural information, has proved of value in coding for recognition; this could be of relevance when constructing psychological models of face recognition. We are not advocating it as 27

a universal code; the observation that very high levels of recognition can be obtained by shape-free contour matching and the increase in recognition when this is combined with the shape-and-texture output demonstrate that not all the facial information has been captured. This suggests that psychological implications of this work are late in the processing chain, when the face is being considered as a whole, One possible model has independent shape and texture derived from our local chart to select a small group of possible matches, with an ultimate recognition decision being made by contour correlation.

References Benson, P. J. and Perrett, D. I.: 1994, Visual processing of facial distinctiveness, Perception 23, 75{93. Brunelli, R. and Poggio, T.: 1993a, Caricatural e ects in automated face perception, Biological Cybernetics 69, 235{241. Brunelli, R. and Poggio, T.: 1993b, Face Recognition: Features versus Templates, IEEE: Transactions on Pattern Analysis and Machine Intelligence 15, 1042{1052. Choi, C. S., Okazaki, T., Harashima, H. and Takebe, T.: 1990, Basis generation and description of facial images using principal component analysis, Technical Report of IPSJ:Graphics & CAD 46, 43{50. in Japanese. Cootes, T. F., Taylor, C. J., Cooper, D. H. and Graham, J.: 1995, Active shape models { their training and application, Computer Vision and Image Understanding 61, 38{59. Costen, N. P., Parker, D. M. and Craw, I.: 1994, Spatial content and spatial quantisation e ects in face recognition, Perception 23, 129{146. Craw, I. and Cameron, P.: 1992, Face recognition by computer, British Machine Vision Conference 1992, pp. 498{507. 28

Craw, I. G.: 1995, A manifold model of face and object recognition, in T. R. Valentine (ed.), Cognitive and Computational Aspects of Face Recognition, Routledge, London, chapter 9, pp. 183{203. Craw, I., Tock, D. and Bennett, A.: 1992, Finding face features, Proceedings of ECCV-92, pp. 92{96. Edelman, S., Reis eld, D. and Yeshurun, Y.: 1992, Learning to recognise faces from examples, Proceedings of ECCV-92, pp. 787{791. Hancock, P. J. B., Baddeley, R. J. and Smith, L. S.: 1992, Principal components of natural images, Network 3, 61{70. Kirby, M. and Sirovich, L.: 1990, Application of the Karhunen-Loeve procedure for the characterisation of human faces, IEEE: Transactions on Pattern Analysis and Machine Intelligence 12, 103{108. Kohonen, T., Oja, E. and Lehtio, P.: 1981, Storage and processing of information in distributed associative memory systems, in G. Hinton and J. Anderson (eds), Parallel models of associative memory, Erlbaum, Hillsdale N.J., chapter 4. Lades, M., Vorbruggen, J. C., Buchmann, J., Lange, J., v. d. Malsburg, C., Wurtz, R. P. and Konen, W.: 1993, Distortion invariant object recognition in the dynamic link architecture, IEEE Transactions on Computers 42, 300{311. Lanitis, A., Taylor, C. J. and Cootes, T. F.: 1994, An automatic face identi cation system using exible appearance models, British Machine Vision Conference 1994, pp. 65{74. Lanitis, A., Taylor, C. J., Cootes, T. F. and Ahmed, T.: 1995, Automatic interpretation of human faces and hand gestures using exible models, Internation Workshop on Automatic Face- and Gesture-Recognition, MultiMedia Laboratory, University of Zurich, Winterhurerstrasse 190, CH-8057 Zurich, pp. 98{103. 29

O'Toole, A. J., De enbacher, K. A., Valentin, D. and Abdi, H.: 1994, Structural aspects of face recognition and the other race e ect, Memory and Cognition 22(2), 208{224. Pentland, A., Moghaddam, B. and Starner, T.: 1994, View-based and modular eigenspace for face recognition, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 84{91. Rao, R. P. N. and Ballard, D. H.: 1995, Natural baisis functions and topographic memory for face recognition, Proceedings of IJCAI95, pp. 10{17. Rhodes, G., Brennan, S. E. and Carey, S.: 1987, Identi cation and ratings of caricatures: Implications for mental representations of faces, Cognitive Psychology 19, 473{497. Robertson, G. and Craw, I.: 1994, Testing face recognition systems, Image and Vision Computing 12, 609{614. Shackleton, M. A. and Welsh, W. J.: 1991, Classi cation of facial features for recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 573{579. Turk, M. and Pentland, A.: 1991, Eigenfaces for recognition, Journal of Cognitive Neuroscience 3, 71{86. Ullman, S.: 1989, Aligning pictorial descriptions: An approach to object recognition, Cognition 32, 193{254. Watt, R.: 1994, A computational examination of image segmentation and the initial stages of human vision, Perception 23, 383{398.

30