A Comparison of the Statistical Properties of IQA Databases Relative

0 downloads 0 Views 5MB Size Report
have compiled, in order to evaluate how broad a range of the statistics ... These databases comprise a number of lossless images (not always ... least scantily documented) protocol for the selection of images. ... except that they are divided into five categories: animals, landscape, people, ..... It is necessary to take into.
A Comparison of the Statistical Properties of IQA Databases Relative to a Set of Newly Captured High-Definition Images Javier Silvestre-Blanes1, Ian van der Linde2 , and Rub´en P´erez-Llor´ens1 1

2

Instituto Tecnol´ ogico de Inform´ atica (ITI), Universitat Polit`ecnica de Val`encia (UPV), Ferrandiz y Carbonell s/n, 03801 Alcoy, Spain {jsilves,ruperez}@disca.upv.es Vision & Eye Research Unit (VERU), Postgraduate Medical Institute, Anglia Ruskin University, East Road, Cambridge CB1 1PT, United Kingdom [email protected]

Abstract. A broad range of image processing applications require image databases during development and testing. Whilst some image databases have been assembled with specific applications in mind, others are intended for more general use, with image content that is purposefully not application-specific. General-purpose image databases are in frequent use in the development of new compression algorithms, including in the evaluation of the efficacy of lossy compression techniques via statistical and human (perceptual) image quality assessment methods. The question of how the images featuring in standard image databases are selected is important, but is rarely quantitatively justified. In this article, we describe the compilation of a new image database of high-definition color images. We present statistical analyzes both of the images that feature in the most widely used extant databases, and the new database that we have compiled, in order to evaluate how broad a range of the statistics measured each database spans.

1

Introduction

The development of new image processing algorithms often requires image databases for testing and validation. Often, algorithms under development are quite specific, such as those for face recognition and stereo correlation, and correspondingly specific databases for these and other narrowly defined problems exist. However, a number of general-purpose image databases exist, wherein the content of the specific images selected for inclusion is not tailored to satisfy a particular final application, but aims to be useful in a broad range of applications, such as in the development of new compression algorithms, image quality assessment (IQA), and in the analysis of the statistical properties of natural images. Several IQA image databases are in widespread use. These include LIVE [1][2][3], IRCCyN/IVC [4], CSIQ [5], TID [6], A57 [7], Toyama [8] and WIQ [9]. These databases comprise a number of lossless images (not always exclusive A. Fitzgibbon et al. (Eds.): ECCV 2012, Part IV, LNCS 7575, pp. 800–813, 2012. c Springer-Verlag Berlin Heidelberg 2012 

Statistical Properties of IQA Databases

801

to each database, see below), along with a set of degraded versions of each image, typically distorted to different degrees with a range of different distortion methods. In studies examining the limits of the human visual system (HVS), one common database was assembled by van Hateren [10], comprising a set of calibrated grayscale images of natural scenes. Since this database contains some man-made structures, in some studies, a subset of the available images is used (e.g. DOVES [11]). A further calibrated natural image database for color images, without the objective of being for general use, or being a representative subset of the real world, is also available: the McGill Calibrated Colour Image Database [12] (Tabby). Calibration ensures that observed luminance and chrominance are veridically represented when images are digitized, but is unnecessary for most applications. In Table 1, the size and constitution of a number of common image databases are provided. All incorporate images at a relatively low spatial resolution. Historically, one reason for this is that several core applications (such as IQA, and the study of HVS properties) require that image are presented to observers on a monitor without resampling (which introduces imperfections), i.e., are shown at their native resolution on a display device that has a corresponding spatial resolution, thereby imposing a limit on the maximum useful spatial resolution of images featuring in the database. When image resampling is not a concern, other resolutions may be used, as is the case with databases used for in development of new image compression algorithms, such as [13], in which images up to 7116 × 5412 at 16-bits per plane are provided. In almost all common databases, image from Kodak Lossless True Color Image Suite [14] are used, which possess relatively low spatial resolutions (768 × 512 or 512 × 768). Clearly, both the number and content of the images provided in each database will influence the results of specific studies to some degree. In IQA studies, quality metrics will fluctuate significantly, contingent upon the database used for testing. In [15], the authors propose that statistical metrics (such as PSNR and SSIM) work better in databases incorporating images at a wide range of quality settings, since findings will be less informative where very high quality images are used, in which distortions may be barely perceptible, as a consequence of the limited acuity of the HVS. In addition to image resolution, and the severity of distortion introduced, it is also likey that the content of each of the images featuring in standard image databases will affect performance, potentially limiting the generalizability of results. A common characteristic of all databases is the apparently arbitrary (or at least scantily documented) protocol for the selection of images. For CSIQ, described in [5], nothing is said about the selection of the 30 original images, except that they are divided into five categories: animals, landscape, people, plants and urban. The same can be said about the 10 images selected in IVC database [17], although many of these are standard images in widespread use by the image processing community. The images in LIVE [3] were selected to ensure diverse image content, and originate from the Kodak Lossless True Color

802

J. Silvestre-Blanes, I. van der Linde, and R. P´erez-Llor´ens

Image Suite, the Internet and CD-ROMs. Specifically, images include pictures of faces, people, animals, close-up shots, wide-angle shots, natural scenes, manmade objects, images with distinct foreground/background configurations, and also images without any specific object of interest. Almost all images featuring in the Toyama database originate from the Kodak Suite, and all-bar-three also feature in the LIVE database. Reference images used in TID20008 are obtained by cropping from the Kodak Suite. Once again, the selection procedure is largely undocumented. Furthermore, the image cropping performed to reduce image size will alter global image statistics, reducing scope for the comparison of results with those of LIVE and Toyama. The WIQ database is restricted to special distortion cases [18], such as those produced by packet loss in wireless communication, comprising a number of images well-known to the image processing community, rather like IVC, although in grayscale form. In Fig. 1 eight common images featuring in the LIVE, Toyama and TID2008 (which uses cropped versions) databases are shown, along with the names used in each database; the ubiquitous nature of these images means that they have borne a significant influence on the IQA field. Our objective in this study is to compile a new database, denoted GID (General Image Database [16]), in which the selection of images is justified through the objective analysis of low-level scene statistics (rather than selecting images by hand that appear to possess a range of desirable properties), and in which images may be further categorized by their semantic content. By labeling test images according to a range of statistical metrics, the image processing community may test algorithms under development by selecting images with specific statistical properties, enabling them to triangulate the efficacy of algorithms across a range of input conditions, look for input statistic-performance correlations, and so on.

Fig. 1. Common images from LIVE, Toyama & TID2008 Databases

Statistical Properties of IQA Databases

803

Table 1. Properties Name LIVE Toyama CSIQ IVC A57 TID WIQ resolution 0.5) between many of the image statistics calculated, with many other medium sized correlations (r =0.3-0.5). In particular, SD and CRMS appear to be colinear, so we take CRMS as an indicator of contrast. Likewise, since E and U are very closely correlated, we take E as an indicator image complexity. In Fig. 3, the dispersion of each image statistic defined is shown for each image plane. Normalized ranking each database (0 for lowest and 1 for the highest. See Table 3) reveals that the LIVE and van Hateren database (vHt) have the greatest variability across the statistics measures, and DOVES (a subset of the van Hateren database) the lowest variability. The same analysis in the H plane reveals that the CSIQ database has the greatest variability across the statistics measures, whereas LIVE have the lowest. At this point, the choice of IQA database for the highest variability is between GID and CSIQ, however, further image properties are to be examined. Sometimes, log intensity is considered, so charts of the histograms of ln(I(i, j))− average(ln(I)) [24][25] were calculated. In Fig. 4, histograms of some representative databases for the plane V are shown. In the database with the greatest number of images, van Hateren, positive skew is appreciable (ς = 2.7). This is attributed by some authors [24] to the presence of (high intensity) daytime sky in many images. Similar skewness values are obtained for IQA databases. GID yields the lowest skewness value (ς = 2.29), whereas LIVE, CSIQ and Tabby have ς = 2.46, ς = 2.50, and ς = 2.49, respectively. In these cases, high intensity values that predominate over low intensity values are not only due to sky, but to many other parts of image content, such as clothes, stones, or sails. The intensity distribution shows a uniform distribution for all databases, except for CSIQ which has irregularities in the tails of histograms. This could be due to the inclusion of images with many pixels with intensities concentrated in the extremes, like in snow leaves, roping (note that green pixels have high intensity in the V plane), sunsetcolor, or family. This kind of image should be included in a database for IQA, since despite being

806

J. Silvestre-Blanes, I. van der Linde, and R. P´erez-Llor´ens

Table 3. Image databases ranked for each image statistic in V and in H plane, and ranked overall Name vHt1 LIVE Tabby1 GID CSIQ DOVES1

CRM S 0.65 1 0.64 0.43 0.36 0

E 0.76 0.68 0.55 0.47 0.31 0

M 0.35 0 0.70 0.65 1 0.31

SK points Name CRM S 0.31 2.07 CSIQ 1.00 1 2.68 Tabby 1 0.74 0.15 2.04 GID 0.00 0.22 1.77 LIVE 0.20 0 1.67 0.02 0.33 1 Not an IQA database.

E 0.00 0.54 1 0.25

M 0.58 1 0.80 0

SK 1 0 0.16 0.03

points 2.58 2.28 1.96 0.48

Fig. 4. Histograms of ln(I(i, j)) − average(ln(I)) for V and H planes

statistically unusual, it is entirely natural. However, where the number of images in a set it as a premium, such images should be relatively sparse. In GID, statistically unusual images exist, producing a higher dispersion in the distributions of core image statistic, but at the same time, due to the small proportion of such images overall, our histograms don’t have marked irregularities. Histograms of some representative databases for pixels in the H plane are shown in right Fig. 4. We observe higher skewness for Tabby in this plane (ς = 4.91), which could be explained by the concentration of some colors in the images, since this database is not intended as a representation of the real world and is not specifically intended for use in the development of image processing algorithms. LIVE and CSIQ have ς = 1.62 and ς = 1.66 respectively. The shape of the histograms shows irregularities in both cases, which could be due to their low number of images, and a wide range in LIVE database. GID has a ς = 2.3 and uniformly wide shape, so we can conclude that from the point of view of color information, GID has rich uniform distribution with respect to other IQA databases. Concerning other statistics, gradients are the simplest way to analyze the relationship between pairs of pixels. The forward difference gradient at a pixel (i, j) in the plane I can be calculated as:

Statistical Properties of IQA Databases

807

Dx(i, j) = ln(I(i + 1, j)) − ln(I(i, j)); Dy(i, j) = ln(I(i, j + 1)) − ln(I(i, j)) (7)  D(i, j) = Dx(i, j)2 + Dx(i, j)2 (8) It is accepted that the gradient histogram has a very sharp peak at zero, and α falls off quickly [26]. This distribution can be modeled as e−x with α < 1 [27]. The reason for this shape is connected to the general mixture of large smooth surfaces with few high contrast edges. Analyzing the α values in the databases available, it tends to be higher where a greater proportion of natural images are used, since then the edges are similarly distributed in all images and directions. Thus, we get α = 0.853 for van Hateren, α = 0.82 for CSIQ, α = 0.79 for LIVE and Tabby, and α = 0.76 for GID. In Fig. 5 the log(histogram) is shown in order to appreciate differences in the tails. Note the assymetry exhibited in the van Hateren database, which is due to many sky portions in images, although this could be due to other image properties. The CSIQ database has symmetrical tails, indicating that, on average, edges go all directions and are generally less noticeable. LIVE has concave tails on both sides, which could be due to the fact that edges are a strong component in images. GID and Tabby, like van Hateren, have a concave tail only on the left, indicating that the gradients of these databases may be a better representation of real world. The analysis of gradients in the H plane do not give different properties between the color images databases. The analysis of Fourier power spectrum is also usually done to obtain image statistics. These analyses show how low frequencies contain the most power, which decrease as a function of frequency. Analyzing the amplitude as a function of frequency (P ) in a log-log scale over a sufficient number of images, the result can be modeled as P = 1/f β , where P = 1/f β is the spectral slope. Some works obtain β for different image ensembles (man-made, vegetation, etc.), and it can be assumed that the average spectral slope varies from 1.8 to 2.4, with most values clustering around 2.0 (a brief review of this and the related references can be found in [26]). Other studies analyze the shape of the power spectrum signature. Fig 6 shows the V plane power spectrum signatures of with 50% (red), 60% (blue), 75% (green) an 90% (yellow) of the energy over the power spectrum of the databases analyzed. It can be seen how signatures are coarse when the number of images are low, whereas it is well defined if the number of images is high. Also, the red signature is small when the number of images is high, indicating that the low frequencies are the main component of the images, so particular properties of some image of the dataset do not change this. In [28] it is shown how the kind of images produce special shapes for their signatures. Thus, it can be seen how van Hateren and DOVES have the shape of natural objects, though the lower number of images in DOVES makes these shapes wider. Tabby has a mixture of man-made objects and natural objects, which is an expected result, since it is a mixture of different types of images. GID shows similar behavior, although weighted slightly towards the natural shape.

808

J. Silvestre-Blanes, I. van der Linde, and R. P´erez-Llor´ens

Fig. 5. Log-histograms of D for V plane

Higher order statistics are only valid if the image exhibits stationary statistics, like Wavelets or Gabor. These statistics cannot be used to select individual images of a big set, but the final results have to be coherent with the results shown, and there must not be significant differences in the new set defined with respect to the figures presented.

3

Image Selection

The image selection process aims to find a global representation of the real world, including natural scenes and other image types (see Fig. 7). Furthermore, multidimensional classification enable studies to focus upon different types of images. The number of images N is limited due to the main target of these images, that is, development and validation of IQA metrics. It is necessary to take into account that each original image has to be distorted using n distortion types. For each distortion, m different levels are applied, such that each one of the N mn images is evaluated by one of O observers. Finally, N mnO observations are compiled for analysis. For each distortion types, n, artifacts that reflect common coding and transmission systems are included, sometimes via simulation. The number of distortion levels, m, may be high, but it is typically considered unnecessary to include a large number of levels, since subjective evaluation is often limited to 5 rating categories (imperceptible, perceptible, slightly annoying, annoying, very annoying) [29]. The number of observers, O, should be large enough to be statistically representative. Thus, N should be chosen in order to achieve a sufficiently representative set of real world images, and span a range of image statistics, but at the same time ensure that subjetive experiments are feasible (both in terms of time and cost).

Statistical Properties of IQA Databases

a) van Hateren

d) CSIQ

b) DOVES

e) LIVE

809

c) GID

f) Tabby

Fig. 6. Mean power spectrum signature for each image database

Fig. 7. Example GID images

The set of 500 images provided in GID may be reduced without loss of representivity. We use a random selection process to reduce the number of images from the original collection required without loss of global statistical characteristics. Of the common subsampling methods available: Simple Random Sampling, Stratified Random Sampling, and Cluster Sampling, since we only consider the image as a set (without exclusive subsets), simple random sampling was used with the reservoir sampling algorithm [30]. First, we get a subset of 200 images, denominated GID200 . From this subset, we repeat the random process to select 50 images, this is the GID50 subset. Iteratively, we try to reduce the value of N , in this case to 12, and from GID50 we repeat the random process four different A B C D times, to select GID12 , GID12 , GID12 and GID12 . In Fig. 8 the first order statistics of the V plane for GID and for all the mentioned subsets can be seen. We can see how the dispersion and complexity represented by the parameters mentioned in the previous section are mantained in GID200 and have a slight

810

J. Silvestre-Blanes, I. van der Linde, and R. P´erez-Llor´ens

Fig. 8. Ranked image statistic dispersions (SD) for each image subset images. Error bars are 1.96 SEM. subset images.

Fig. 9. Log-histograms of D for the V plane for subsets

reduction in GID50 , especially in CRMS , and is significant in the other subsets, especially in entropy and skewness. Similar behavior is obtained in the H plane, as we can see in Fig. 8. The skewness of log intensity histograms on ln(I(i,j))A B C D , GID12 , GID12 , GID12 are ς = 2.29, average(ln(I)) for GID200 , GID50 , GID12 ς = 2.28, ς = 2.11, and ς = 2.44, ς = 2.42, ς = 2.39, and ς = 2.33. Significant differences with the reduction of number of images in the subsets were not found. The gradients in the subsets, even in the smaller subsets, gave similar results to the full set GID, as we can see in Fig. 9. The power spectrum of signatures of the subsets are shown in Fig. 10. We can see how the main properties are maintained for GID200 and GID50 , while we get quite different shapes when we reduce the number of images to 12, changing then the average image type.

Statistical Properties of IQA Databases

a) GID200

b) GID50

c) GIDA 12

d) GIDB 12

e) GIDC 12

811

f) GIDD 12

Fig. 10. Spectral signatures of GID subsets

4

Conclusions and Future Work

We have analyzed several sets of images databases that are used for different purposes, and we have seen how they have very different properties. The images contained in these databases are used for the development of new image processing algorithms, including lossy compression, image quality assessment, etc. The properties of the set can have an influence on the algorithms developed. In this paper we have compiled a new set of images, with two important differences with respect to previous databases: the use of high definition resolution, and the use of a large number of images. We have analysed this set and compared their statistics with other databases. After this, we obtained some subsets of the original one, and it was seen that their representativity is maintained when N is reduced from 500 to 200 and even to 50, rendering subjective image quality rating feasible. The reduction of N to 12 had an impact on some statistics. Our next task is to complete subjective image quality evaluation for these images (spefically, GID50 ). With this analysis, we will obtain the results for smaller A B C D subsets (GID12 , GID12 , GID12 , GID12 ), so we can determine the influence of the number of images, and of the image statistics, on the evaluation of image quality. Acknowledgement. This work is supported by the MCYT of Spain under the project TIN2010-21378-C02-02 and by Universidad Polit`ecnica de Val`ecia under PAID-00-11.

References 1. Sheikh, H.R., Wang, Z., Cormack, L.K., Bovik, A.C.: LIVE image quality assessment database release 2, http://live.ece.utexas.edu/research/quality 2. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 3. Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 11(15), 3440–3451 (2006) 4. Le Callet, P., Autrusseau, F.: Image quality evaluation database, http://www.irccyn.ecnantes.fr/ivcdb

812

J. Silvestre-Blanes, I. van der Linde, and R. P´erez-Llor´ens

5. Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment and the role of strategy. J. Electron. Imaging 19(19), 011006 (2010) 6. Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: TID2008 - A Database for Evaluation of Full-Reference Visual Quality Assessment Metrics. Advances of Modern Radioelectronics, 30–45 (2009) 7. Chandler, D.M., Hemami, S.S.: VSNR: A Wavelet-Based Visual Signal-to-Noise Ratio for Natural Images. IEEE Trans. Image Process. 16(9), 2284–2298 (2007) 8. Horita, Y., Kawayoke, Y., Parvez Sazzad, Z.M.: Image quality evaluation database, ftp://[email protected] 9. Engelke, U., Zepernick, H.-J., Kusuma, M.: Wireless Imaging Quality Database, http://www.bth.se/tek/rcg.nsf/pages/wiq-db 10. van Hateren, J., van der Schaaf, A.: Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. B-Biol. Sci. 265(1394), 359–366 (1998) 11. van der Linde, I., Rajashekar, U., Bovik, A.C., Cormack, L.K.: DOVES: A database of visual eye movements. Spatial Vis. 22(2), 161–177 (2009) 12. Olmos, A., Kingdom, F.A.A.: A biologically inspired algorithm for the recovery of shading and reflectance images. Perception 33, 1463–1473 (2004) 13. Rawzor, www.imagecompression.info 14. Kodak, http://r0k.us/graphics/kodak/ 15. Tourancheu, S., Autrusseau, F., Parvez Sazzad, Z.M., Horita, Y.: Impact of subjective dataset on the performance of image quality metrics. In: IEEE Int. Conf. in Image Processing (ICIP), San Diego, California, USA (2008) 16. Silvestre-Blanes, J.: http://muro1.alc.upv.es/eccv12/gid.html 17. Ninassi, A., Le Callet, P., Autrusseau, F.: Pseudo No Reference image quality metric using perceptual data hiding. In: SPIE Electronic Imaging, Human Vision and Electronic Imaging Conference XI, HVEI 2006, San Jose, USA (2006) 18. Engelke, U., Kusuma, M., Zepernick, H.J., Caldera, M.: Reduced-Reference Metric Design for Objective Perceptual Quality Assessment in Wireless Imaging. Signal Process.-Image Commun. 24(7), 525–547 (2009) 19. Cubero, S., Aleixos, N., Molt, E., G´ omez-Sanchis, J., Blasco, J.: Advances in Machine Vision Applications for Automatic Inspection and Quality Evaluation of Fruits and Vegetables. Food Bioprocess Technol., 487–504 (2011) 20. Blasco, J., Aleixos, N., Molt, E., G´ omez-Sanchis, J.: Citrus sorting by identification of the most common defects using multispectral computer vision. J. Food Eng., 384–393 (2007) 21. Rajashekar, U., van der Linde, I., Bovik, A.C., Cormack, L.K.: GAFFE: A gazeattentive fixtion finding engine. IEEE Trans. Image Process. 17(4), 564–573 (2008) 22. P´ arraga, C.A., Brelstaff, G., Troscianko, T.: Color and luminance information in natural scenes. J. Opt. Soc. Am. A-Opt. Image Sci. Vis. 15, 563–569 (1998) 23. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Inc. (1988) 24. Huang, J., Mumford, D.: Statistics of Natural Images and Models. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 1999), Ft. Collins, CO, USA, vol. 1, pp. 541–547 (1999) 25. Brady, M., Field, J.D.: Local constrast in natural images: normalization and coding efficiency. Perception 29, 1041–1055 (2000)

Statistical Properties of IQA Databases

813

26. Pouli, T., Cunningham, D.W., Reinhard, E.: Image Statistics and their Applications in Computer Graphics. In: Eurographics, Norrkping, Sweden, pp. 83–112 (2010) 27. Simoncelli, E.: Statistical modeling of photographic images. In: Bovik, A. (ed.) Handbook of Image and Video Processing, pp. 431–443. Elsevier Academic Press (2005) 28. Torralba, A., Oliva, A.: Statistics of natural image categories. Network: Comput. Neural Syst. 14, 391–412 (2003) 29. ITU-R BT.500-7 Methodology for the Subjective Assessment of the Quality of Television Pictures 30. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(11), 37–57 (1985)