SPIE2000 final version - CiteSeerX

2 downloads 377 Views 470KB Size Report
Haralick implemented using the matrix itself as the feature vector. ... TUTS implemented as local binary patterns10, reducing the feature space dimensionality to ...
EI2000, SPIE paper no. 3959-44

Applying perceptually-based metrics to textural image retrieval methods Janet S. Paynea, Lee Hepplewhiteb, T. J. Stonhamb a

b

Dept of Computing, Buckinghamshire Chilterns University College, High Wycombe, HP11 2JZ, UK

Dept of Electrical Engineering, Brunel University, Uxbridge, UB8 3PH, UK

ABSTRACT Texture plays an important part in many Content Based Image Retrieval systems. This paper describes the results from a human study, which asked 30 volunteers to classify images from the Brodatz Textures album. We use these results to derive a subset which show good agreement among the different individuals. The results for this subset were used to evaluate the retrieval performance of a range of statistical, Fourier-based, and spatial/spatial filtering methods. However, no one computational method works well for all textures, unlike the human visual system. We show how each of the ten methods correlates with the rankings from the human studies.the results typically match for only about 20%-25% of the images. Combining two techniques can improve the retrieval performance, as judged by human users. We also identify a further subset of the Brodatz images where no computer method correlates significantly with the composite human ranking. Of the 85 images selected by the human study, only 64 have any significant correlation with one or more of the computational methods in this paper. The excluded images, where human users agree with each other, but none of the methods we evaluated did, provide a further challenge to texture-based image retrieval techniques.

Keywords: Texture, content-based image retrieval, perceptual similarity, retrieval metrics

1. INTRODUCTION Texture plays an important role in human vision, and is of great significance for Computer Based Image Retrieval 6, 8 (CBIR). Successful CBIR systems have been implemented using colour, texture, shape and spatial relationships, individually or in combination. Colour similarity is relatively straightforward to implement, and has been used successfully, for example in QBIC 7 and Virage 8; however, colour alone cannot distinguish between tigers and cheetahs! There have been a number of studies relating texture and human perception, from the early work of Julesz, and of Beck, onwards. Many researchers have made use of the Brodatz photographic album of textures 3, and although other groups have proposed additional sets of images, the Brodatz textures remain the most widely used. For CBIR, it is of interest to compare the images retrieved by a range of computational methods with those chosen by individual human viewers; if the aim is to retrieve “images like this one”, then the human user of the system should be the final arbiter of which other images are most like “this one”. Since the Brodatz textures are widely used, and form a relatively small set (of 112 images), it is appropriate to investigate how human viewers would classify similar images among them. The

EI2000, SPIE paper no. 3959-44 majority of the textures are homogeneous, and generally of high spatial frequency; they do not therefore represent all possible textures of interest. Additionally, a few are of recognisable images, and the human tendency to label, whether correctly or not, quickly comes into effect. There is no equivalent “labelling” by the computational methods! However, the Brodatz textures still provide a useful basis for comparison, within these limitations.

2. TEXTURAL FEATURES AND HUMAN PERCEPTION Several studies have focused on relating computational measures of texture to human perceptions: for example, Tamura, Mori and Yamawaki23, Amadasun and King1, and Liu and Picard 15. Rao and various colleagues have been developing on a “Texture Naming System” based on unsupervised human classification of a set of 56 Brodatz images, and matching with texture names2. There is some agreement between most researchers on the main perceptually salient attributes of texture. The names chosen are “somewhat subjective”, but include repetitiveness or periodicity, coarseness, directionality, degree of complexity, and contrast. The human visual system works well for many different textures, unlike any of the computational methods, as most authors comment! The Brodatz album3 is a well-established standard, and all of these studies used some or all of these images. The original Brodatz album is a photographic one, with high quality reproduction of the individual textures, each one occupying most of a 7¾ by 10½ inch page (with a picture area of 19.5 by 24 cm). However, the various studies have worked with digitised images, using varying numbers of pixels and grey levels – most commonly 384 by 384 pixels, with 256 grey levels 1,15 , but 256 by 256 pixels, with 64 grey levels23, and 512 by 512 pixels, 8 bit grey levels, have also been used. In all cases, square images are used, usually by omitting a strip at the bottom. There are also variations in lighting and method of digitisation (eg photography, or scanning, with various resolutions), which make it harder to compare the different studies. The loss of quality in going from the original page to a computer screen or a 300 or 600 dpi laser printed page is very noticeable. For comparability, we digitised all 112 Brodatz images, using a scanner, to 384 by 384 pixels, 8 bit grey scale, and used this both for the human study, and for each of the image retrieval methods described below. Each image was divided into nine non-overlapping tiles, of 128 by 128 pixels.

Figure 1 Display screen, for human study, awaiting user input

EI2000, SPIE paper no. 3959-44 In our human study20, 21, volunteers were asked to select up to four images, in order, that they considered “most like”, the central, target image. We used one tile only from the Brodatz textures, the top left hand corner, as we could not expect individuals to work through 1008 images. Figure 1 shows one of the screens, awaiting user input. So far, 30 people have taken part, and the results have been combined, to form a composite ranking of similar images, and correlation coefficients have been calculated between each individual and the composite result.

3. MEASURES OF RETRIEVAL In document information retrieval, performance is traditionally measured in terms of recall and precision 22. Recall measures the ability of the system to retrieve useful documents, while precision measures its ability to reject useless documents. The two measurements are inter-related, and require some method of defining relevance. If the Brodatz textures are divided into tiles, one approach is to consider all the remaining tiles of each image to be relevant 17. Most, but not all, of the textures are homogenous, so this is suitable for the majority. It may be misleading for the non-homogenous images. Another approach is for the experimenter to define relevant images, but this is somewhat subjective, and difficult with large datasets. Our human study described above in section 2 is used to derive a set of relevant images for each of the Brodatz textures, by combining the individual responses, and excluding images with poor agreement. The traditional document information retrieval measures can also be applied to image retrieval6, and have been adapted by eg QBIC7, to provide a normalized recall. They define the AVRR, the average rank of all relevant, retrieved images, which can be compared with the IAVRR, the ideal AVRR if all T relevant items occur in the first T retrievals. A lower score, closer to the IAVRR, therefore represents a retrieval with fewer irrelevant items. If the order of retrieval matters to the user, which it may well, there are a variety of statistical techniques which provide measures of association given two ordered series, such as Spearman’s rho or Kendall’s tau coefficient of rank correlation5. Spearman’s rank-order correlation coefficient, rho, is perhaps the best known of all the statistics based on ranks, and was the earliest to be developed. It was used by Tamura, Mori and Yamawaki23. Kendall’s tau is perhaps clearer to understand and apply 13, and although the calculated numerical values of the coefficients will be different, both will produce nearly identical results in most cases, and both can be used to test the significance of the association between two sets of data5. Kendall’s tau can be viewed as a coefficient of disorder. For example, consider the following two rankings, where both have selected the same four images, but have placed them in a different order: 1234 2143 That is, the first person’s first choice is ranked second by the other person, and so on. Tau is calculated as no of pairs in order — no. of pairs out of order total no. of possible pairs

(1)

For this example, 2 in the bottom row is followed by 1, 4, and 3. 2-1 is out of order, scoring -1, and 2-4, 2-3 are in order, scoring +1 each. Similarly, 1 is followed by 4 and 3. Both are in order, scoring +1 each. Finally, 4 is followed by 3, scoring -1. The number of in-order pairs is four, and out-of-order pairs is two, therefore the total is +2, divided by the maximum number of in-order pairs, N(N - 1) / 2, which here is 6, since N = 4. The value of tau is therefore 2/6, or 0.3333. This gives a measure of the “disarray”, or difference in ranking, between the two. It ranges from -1, which represents complete disagreement (a choice of 4 3 2 1 in this example), through 0 (no correlation), to +1, complete agreement (1 2 3 4 in this example). We have used Kendall’s tau to correlate the results of the human study and a number of computational methods 21 . For the number of retrievals used here, the one-tailed 1% significance level is 0.8, and the 5% significance level is 0.6.

4. IMAGES EXCLUDED FROM THE COMPARISONS The results from the human study described in section 2 were used to calculate the rank correlation of each individual’s selection with respect to the combination of the choices made by all 30 people. The correlation coefficients for each

EI2000, SPIE paper no. 3959-44 individual with respective to the composite ranged from a minimum of -0.0370 (ie, no correlation at all), to a maximum of 1. Mean values, across all images, ranged from 0.4354 to 0.7686, with 25 of the 30 people having a mean tau > 0.6, the 5% significance level. Images where less than half the group had a significant correlation with the overall ranking have been excluded. Figure 2 shows these images:

D1

D8

D13

D14

D20

D21

D24

D37

D39

D48

D49

D53

D54

D55

D61

D67

D71

D72

D80

D82

D84

D86

D87

D91

D97

D100

D112

Figure 2 Images showing poor agreement within the human study

Images such as D37 (water), D48 (perforated panel) and D87 (fossilised sea fan) are unlike any other images in the set, so it is not surprising that there was little or no agreement among individuals as to what other images were most similar. If individual humans have difficulty assigning a consistent ranking to these, it is not appropriate to use them for evaluation and comparison of textural retrieval methods. The composite ranking for each of the remaining 85 images has been used in the rest of this paper, as the standard for comparison with each computational method.

5. COMPUTATIONAL TECHNIQUES USED IN THE COMPARISON A range of computational methods was selected for comparison, covering statistical, Fourier, and spatial approaches, as discussed in 21. There is one obvious omission, the model based MRF or SAR based methods. These have been omitted from this comparison due to their high computational complexity; as shown for example by the slow execution of a SAR method when compared with the relatively complex Gabor based method17. The following ten methods were implemented as in the relevant reference; any specific parameters are detailed below.

EI2000, SPIE paper no. 3959-44 Statistical ! Haralick implemented using the matrix itself as the feature vector. The number of grey levels in the texture image has been reduced to 16 and a single displacement vector used 9. ! GLCM as above, but with 64 levels of intensity. and the commonly cited matrix features of energy, entropy, correlation, homogeneity and inertia were extracted, using four displacement vectors4. ! TUTS implemented as local binary patterns10, reducing the feature space dimensionality to N=256, from N=6561. ! BTCS with n-tuple size of n=4, and interpixel spacing, t=1, and a global threshold level17. ! GLTCS, with n=4 and t=1 as for BTCS18. ! SRank n=4 and t=1 as in GLTCS and BTCS but with "roughly equal to" band of " 5 levels11. Fourier-based ! R&W implemented as in24 with four ring features and four wedge features. ! LSF Liu's spectral features16 implementing six computationally efficient and optimal features. Spatial ! LTE Laws’ texture energy method14 using nine 3 by 3 masks and a 5 by 5 moving window standard deviation estimate. ! Gabor Gabor filter energy with four orientations and up to four scales, depending on the window size. The features extracted are quadrature filter pair mean and standard deviation of energy12.

5.1 Correlation with Statistical Techniques Three of the textures correlate significantly with 5 of the six statistical techniques, shown in the top row of Figure 3, and five more, with 4 of the six statistical techniques, shown in the bottom row. In general, all show fine texture, with some regularity. Some show strong directionality, others not; this is not a significant feature for statistical methods.

D31

D2

D83

D23

D36

D106

D81

D105

Figure 3 Strongly correlated images, using statistical methods

5.2 Correlation with Fourier-based Techniques Both Fourier-based methods pick out the following images, shown in Figure 4. D16, D32 and D33 are fine structured, high frequency, and low contrast, whereas D46, D62, D88 and D102 show regularity and in most cases, strong directionality. As can be seen, none of these images were matched by a majority of the statistical methods.

EI2000, SPIE paper no. 3959-44

D16

D32

D33

D46

D62

D88

D102

Figure 4 Strongly correlated images, using Fourier-based methods

There are a further 29 others, selected by one or the other (but not both) of these methods: D3, D4, D6, D10, D12, D19, D22, D23, D26, D30, D31, D36, D44, D57, D64, D65, D70, D74, D81, D83, D89, D90, D95, D98, D103, D104, D105, D106 and D110.

5.3 Correlation with Spatial Techniques Taking Laws’ Texture Energy, and Gabor filters together, as representative of spatial-based methods, the following images are matched by both methods, as shown in Figure 5:

D31

D65

D81

D83

D95

D103

D106

Figure 5 Strongly correlated images, using spatial methods

These generally show fine features and regularity, with strong directionality in most cases. D31, pebbles, appears to be an “easy” image, whether using statistical or spatial methods, as well as being one where the human study showed strong agreement. There are a further 26 images where one or the other, but not both, matches: D4, D7, D9, D10, D12, D18, D23, D25, D26, D36, D42, D45, D46, D56, D57, D58, D62, D74, D85, D88, D89, D90, D98, D102, D104 and D105.

6. IMAGES WITH SIGNIFICANT CORRELATION FOR ALL METHODS No image correlates significantly with 9, or all 10, of the computational techniques used in this study. Figure 6 shows those images which correlate significantly with the selections made by individuals, for 7 or 8 of the ten computational methods:

D31

D81

D83

Figure 6 Strongly correlated images, for almost all methods

D106

EI2000, SPIE paper no. 3959-44 Figure 7 shows those images which correlate significantly with the human rank order, for 6 of the methods:

D23

D36

D95

D103

D105

Figure 7 Strongly correlated images, for most methods

Finally, Figure 8 shows those which correlate significantly with 5 of the 10 methods.

D4

D32

D46

D62

D88

D104

Figure 8 Strongly correlated images, for half the methods

Since the ranking given by the human study correlates with the ordering of the majority of the computational methods used, these can be used to evaluate the effectiveness of other techniques. These images shown regularity, and in most cases, strong directionality, and fine rather than coarse structure.

7. IMAGES THAT ARE DIFFICULT TO RETRIEVE There are a number of textures where only one method significantly matches the human ordering, shown in figures 9 to 13 below. GLCM (the improved version of the original co-occurrence method), performs best, matching 5 images, shown in Figure 9, that are missed by all the other techniques. Generally, these show very fine structure, and regularity, but only one has strong directionality.

D11

D28

D34 Figure 9 matched by GLCM only

D78

D93

EI2000, SPIE paper no. 3959-44 Liu’s SF is the next best, with 3 images that are missed by all others. As can be seen in Figure 10, these show some regularity, but vary in their other attributes.

D6

D64

D110

Figure 10 matched by Liu’s SF only

Ring & Wedge (Figure 11) picks up one, and Gabor (Figure 12) two more. These are somewhat difficult images, in that D70 is not homogenous, but otherwise resembles several of the other textures; and D42 is one of the pictorial images, which people label very quickly, and match to the other pictures of lace.

D70

D18

Fig 11 R&W

D42 Fig 12 Gabor

Finally, Figure 13 shows three of the remaining co-occurrence methods that each pick out one or two that are missed by the others:

D99 TUTS

D27

D38

D70

BTCS

D73 SRank

Figure 13 matched by various co-occurrence techniques

These textures are somewhat irregular, and lacking strong directionality, with varying contrast.

8. COMBINING TECHNIQUES Using the same criteria, where individual techniques show a significant correlation with the ranking produced by the composite of individual rankings from the human study outlined in section 2, the number of textures for each method is shown in Table 1 for the full set of 112 images. Table 2 shows the corresponding matching retrievals for the subset of 85 textures, excluding those where there was a low level of agreement between individuals.

EI2000, SPIE paper no. 3959-44

9. THE UNSELECTED IMAGES Part of the difficulty in improving performance much above 50% is due to the fact that even within the subset of 85 images, there are a number that none of the ten methods select. There are 21 of these, with a correlation coefficient tau < 0.6, shown in Figure 14 below. These provide a challenge to texture-based retrieval techniques.

D15

D17

D40

D41

D47

D50

D51

D52

D66

D68

D69

D75

D76

D77

D94

D99

D101

D107

D108

D109

D111

Figure 14 Textures where no method shows significant correlation

10. CONCLUSIONS Working with a subset of the Brodatz images, derived from a human study of 30 people and their classification of the full set into “most like the query image”, we have shown that no one computational method matches the human classification fully. Typically, any one method matches on only about a quarter of the human classification, as shown by a rank correlation coefficient of tau = 0.6, representing the one tailed 5% significance level. However, different techniques match on different images, and we therefore propose that combining two or more techniques to improve the classification, as judged by the human study. Even so, this only improves the performance to about 50%. It would seem that, using human judgements of “similar” images for relevance criteria, texture-based CBIR still has some way to go. It is also noticeable that even working with a subset where individuals show significant agreement, there remain some 21 images (nearly a quarter of the subset) where none of the computational techniques considered in this paper match the human-derived ranking to any significant extent. It will be interesting to compare additional techniques to see if this can be improved.

REFERENCES 1. 2. 3. 4.

M. Amadasun and R King, “Textural Features Corresponding to Textural Properties”, IEEE SMC vol 19 No5 p1264-1274, 1989 N. Bhushan, A. R. Rao, and G. L. Lohse, “The texture lexicon: understanding the categorization of visual texture terms”, Tech Report, IBM, 1994. P. Brodatz, Textures - a photographic album for artists & designers, Dover, New York, 1966. R.W. Conners and C.A. Harlow, A theoretical comparison of texture algorithms, IEEE Trans. PAMI, vol 2 no.3, p204-222. 1980.

EI2000, SPIE paper no. 3959-44

Haral.

GLCM

TUTS

BTCS

GLTCS

SRank

R&W

LiuSF

LawsTE

Gabor

no.

13

26

29

24

28

27

29

22

23

26

as %

12

23

26

21

25

24

26

20

21

23

Table 1 number of images with significant correlation, full set

Haral.

GLCM

TUTS

BTCS

GLTCS

SRank

R&W

LiuSF

LawsTE

Gabor

no.

9

21

25

22

23

21

24

19

19

21

as %

11

25

29

26

27

25

28

22

22

25

Table 2 number of images with significant correlation, subset of 85 textures In general, there appears to be very little difference in percentage terms, whether we consider the selected subset, as determined by the human study, or the whole set, which includes images where individuals found it hard to agree on the ordering. Excluding the original co-occurrence method, agreement ranges from 20% to 29% only. That is, no one technique corresponds to the perceptual rank ordering in more than about a quarter of the Brodatz textures. However, as has already been noted, the different techniques perform better for different images. Combining techniques produces an improvement in performance, as shown in Table 3, where the retrievals from two different techniques are pooled. TUTS/ LiuSF

TUTS/ R&W

BTCS/ Gabor

BTCS/ LiuSF

TUTS/ Gabor

R&W/ LiuSF

SRank/ LawsTE

no.

41

39

38

36

34

36

31

as %

48

46

45

42

40

42

36

Table 3 Combining two methods

Combining two methods provides a substantial improvement, from a quarter to almost a half of the textures. However, combining three different methods makes much less difference; the major gain in classification performance comes from selecting two complementary methods, such as the ones in Table 3 above.

BTCS/ R&W/ LiuSF

TUTS/ LiuSF/ LawsTE

BTCS/ R&W/ LawsTE

GLTCS/ LiuSF/ LawsTE

no.

47

44

42

40

as %

55

52

49

47

Table 4 Combining three methods

It will be interesting to see if other computational techniques match the ordering on still different images, and thus allow a further improvement in performance. There is also the issue of how to combine the different techniques for improved texturebased image retrieval.

EI2000, SPIE paper no. 3959-44 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

W. J. Conover, Practical Nonparametric Statistics, 3rd ed, John Wiley & Sons, New York, 1999 A. Del Bimbo, Visual information retrieval. Morgan Kaufmann Publishers, San Francisco, 1999. C Faloutsos and M. Flickner and D. Petcovic and W Niblack and W. Equitz and R. Barber, “Efficient and effective querying by image content”, Tech Report, IBM, 1993. A. Gupta and R. Jain, “Visual Information Retrieval”, Comm. ACM, Vol 40, No.5, p70-79, 1997. R.M. Haralick and K. Shanmugam and I. Dinstein, “Textural features for image classification”, IEEE Trans. SMC, vol 3 no 6, p610-621. 1973. D.C. He and L. Wang, “Texture unit, texture spectrum and texture analysis”, IEEE Trans. Geoscience and Remote Sensing, vol 28 No 4, pp 509-512, 1990 L. Hepplewhite and T.J. Stonham and R.J. Glover, Automated visual inspection of magnetic disk media, Proc. of 3rd ICECS, vol 2, p 732-735. 1996 A.K. Jain and F. Farrokhnia, “Unsupervised texture segmentation using Gabor filters”, Pattern Recognition, Vol 24, No. 12, p1167-1186, 1991. M. Kendall and J. Dickinson Gibbons, Rank Correlation Methods, 5th ed,Edward Arnold, London, 1990. K.I. Laws, Texture image segmentation, PhD thesis, University of Southern California, 1980. F. Liu and R.W. Picard, “Periodicity, directionality, and randomness: Wold features for perceptual pattern recognition”, Proc 12th ICPR, vol B, p184-189, 1994. S.S. Liu and M.E. Jernigan, “Texture analysis and discrimination in additive noise”, CVGIP, vol 49, p52-67, 1990 B.S. Manjunath and W.Y. Ma, Texture features for browsing and retrieval of image data, Tech Report, UCSB, 1995 D. Patel and T.J. Stonham, Low level image segmentation via texture segmentation, Proc SPIE Visual Comms. and Image Processing, vol 1606, p621. 1991 D. Patel and T.J. Stonham, Unsupervised / supervised texture segmentation and its application to real world data, Proc. SPIE Visual Comms. and Image Processing, vol 1818. 1992. J. S. Payne, L. Hepplewhite and T. J. Stonham, Evaluating content-based image retrieval techniques using perceptually based metrics Proc SPIE, Vol 3647, p122-133. 1999. J. S. Payne, L. Hepplewhite and T. J. Stonham, Perceptually based metrics for the evaluation of textural image retrieval methods. Proc IEEE ICMCS99, Vol II p793-797, 1999 G. Salton and M. J. McGill, Introduction to modern information retrieval. McGraw-Hill, New York, 1983. H. Tamura and S. Mori and Y. Yamawaki, “Textural features corresponding to visual perception”, IEEE Trans. on SMC, vol 6 no 6, p460-473. 1978. J.S. Weszka and C.R. Dyer and A. Rosenfeld,DA comparative study of texture measures for terrain classification, IEEE Trans SMC, vol 6 no. 4, p269-285. 1976