Application of a visual model to the design of an

2 downloads 0 Views 4MB Size Report
The polychromatic OTFs can be computed for a given display primary spectra ..... W. N. “A simple parametric model of the human ocular modulation transfer.
Application of a visual model to the design of an ultra-high definition upscaler Jon M. Speigle, Dean S. Messing, Scott Daly Sharp Laboratories of America, 5750 NW Pacific Rim Blvd., Camas, WA, USA 98607 ABSTRACT A Visual Model (VM) is used to aid in the design of an Ultra-high Definition (UHD) upscaling algorithm that renders High Definition legacy content on a UHD display. The costly development of such algorithms is due, in part, to the time spent subjectively evaluating the adjustment of algorithm structural variations and parameters. The VM provides an image map that gives feedback to the design engineer about visual differences between algorithm variations, or about whether a costly algorithm improvement will be visible at expected viewing distances. Such visual feedback reduces the need for subjective evaluation. This paper presents the results of experimentally verifying the VM against subjective tests of visibility improvement versus viewing distance for three upscaling algorithms. Observers evaluated image differences for upscaled versions of high-resolution stills and HD (Blu-ray) images, viewing a reference and test image, and controlled a linear blending weight to determine the image discrimination threshold. The required thresholds vs. viewing distance varied as expected, with larger amounts of the test image required at further distances. We verify the VM by comparison of predicted discrimination thresholds versus the subjective data. After verification, VM visible difference maps are presented to illustrate the practical use of the VM during design. Keywords: upsampling, upscaling, resizing, image interpolation, visual model, image differences 1.

INTRODUCTION

Emerging Ultra-high Definition (UHD) displays (e.g., 4096 x 2160 displays) require well-tuned, computationally expensive, adaptive, upscaling algorithms in order to render high-quality imagery from legacy High Definition (HD) source material. High quality upscaling is needed because these displays are expensive and hence customer expectations will be high. Expected use scenarios such as home theatre, radiography, scientific visualization, military situation display, will provide excellent viewing conditions–darkened viewing rooms, closer than usual viewing distances–where algorithm processing artifacts may be quite visible. The trade-off between upscaling algorithm expense and quality is usually made by means of time-consuming informal subjective tests in which a number of expert viewers gather to rate aspects of visual quality such as edge smoothness, texture sharpness, noise suppression, and the visibility of artifacts introduced by the algorithm itself. As candidate algorithms and variations thereof are explored, a substantial cost accrues in both development time and time spent in evaluation meetings since each variation typically affects several aspects of visual quality at once. The use of a Visual Model (VM) can substantially reduce this turn-around time for selecting attractive algorithm options for the target UHD display. The VM provides the researcher with a map of visible differences between an algorithm variation and a suitably chosen Reference or between candidate algorithms. For example, an inexpensive algorithm and an expensive one might only begin to exhibit visible differences at a viewing distance that is much closer than the intended observer viewing distance for the display. In this case the VM can quickly deliver a verdict. This paper presents results from experiments that validate a VM by testing the correlation of the VM output against formal subjective data involving three upscaling methods: an advanced adaptive algorithm, a computationally expensive traditional algorithm, and a computationally inexpensive traditional algorithm. In the course of validating the VM as a function of viewing distance, example difference maps are presented. Ongoing work continues to validate the VM for other aspects of upscaling quality. The paper is organized as follows. In Section 2 we briefly discuss some of the many considerations in a modern upscaling algorithm, and we present the upscaling framework that is used in our verification process. In Section 3 we review relevant prior work on image quality attributes and visual models, and then in Section 4 describe our subjective Human Vision and Electronic Imaging XIV, edited by Bernice E. Rogowitz, Thrasyvoulos N. Pappas, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7240, 72401S © 2009 SPIE-IS&T · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.811784 SPIE-IS&T/ Vol. 7240 72401S-1

method and results in Section 5. In Section 6 we describe the application of a visual model to the problem and its predictions for our subjective experiments. An appendix contains a description of the visual model. 2.

SPATIAL UPSCALING

2.1 Trade-offs Spatial upscaling of images/video is the process of increasing the number of samples used to represent an image or video frame. Modern upscalers additionally attempt to enhance perceived sharpness and edge smoothness while reducing noise and avoiding algorithmic artifacts related to edge and texture classification errors. State-of-art, proprietary (and expensive) upscaling schemes are needed to meet these demands. Non-linear schemes adaptive to image content have the potential of preserving object boundaries rather than blurring or corrupting them as do classical linear upscalers. But, as with classical upscalers, adaptive ones tend to blur textures. One approach is to add an enhancement stage to further increase image detail beyond the basic upscaling method. Sharpness enhancement, however, can lead to boosted film grain or compression noise, requiring inclusion of a nonlinear noise reduction stage. Figure 1 illustrates possible orderings of such processing blocks. HD Input Frame

(a)

Up-scaling

Noise Control

Enhancement

(b)

Noise Control

Up-scaling

Enhancement

(c)

Noise Control

Enhancement

Up-scaling

4K Output Frame

Figure 1. Possible upscaler variations.

Module ordering can have several impacts. First, computational load will vary, as the number of pixels processed depends on the underlying resolution. In the conversion of algorithms to hardware, the computational and buffer requirements may lead to design trade-offs such as coarser approximations to a division operation or fewer line buffers. Second, reordering can have subtle visual effects, as preceding modules impact the behavior of subsequent modules. For example, preceding an edge-adaptive upscaler with enhancement may help the upscaler lock onto edges more accurately, but reconstructed high-frequency detail may go unenhanced. 2.2 Spatial upscaling framework Two of the upscaling algorithms we used in our studies were global. Results from the third were generated using a unified two-channel algorithm framework, depicted in Figure 2. For this study the framework was restricted to two channels, texture and edges+flat regions (sometimes referred to as the "cartoon structure"), although in general the framework has more channels. The upper part of the figure shows the highest level functional blocks; the lower part, a slightly more detailed view. The correspondence between upper and lower parts is indicated by the large arrows. Examples of the decomposer channel output are shown in Figure 3. The non-linear operation of the decomposition is easily seen in that most of the texture has been removed from the edge/flat-region channel without affecting edge structure. Each of the two channels from the decomposition is processed independently by texture-adaptive and edgeadaptive interpolators, respectively. A third output of the decomposer is "side information" consisting of a confidence measure which is used to form a blending map for use in recombining the filter channel results into a single image. A primary goal of our work is to provide design engineers with a tool that may be used to aid in adjusting the numerous parameters of the decomposer, filtering, and recombination steps. One approach is to maintain a desired output while making algorithm modifications.

SPIE-IS&T/ Vol. 7240 72401S-2

Film grain Text + Graphics Edges

Texture Information

Two-channel non-linear decompostion

Channel recombination

(2 channels) shown

Texture

Channel-specific Upconversion

1920x1080 Multi-channel (Full HD) input spatial decomposition

4096x2160 output

Training-based texture interpolation

α− map formation

α− map

"side information"; not a channel

Confidence based α− blending

Edge-adaptive interpolation

Edges + Flat regions Figure 2. Multi-channel adaptive upscaler.

a)

b)

c)

Figure 3. Example decomposition output: a) input, b) edge/flat-region channel, and c) texture channel.

2.3 Filters used during VM verification Three increasing higher quality (and more costly) upscaling algorithms were used in the process of verifying the VM. The lowest quality upscaler was the standard benchmark — bilinear interpolation. The interpolator was realized within our framework by adjusting the decomposition to pass the entire input image into the texture channel, and nothing into the edge channel. The texture filter was then defined by a 3x3 tap bilinear kernel implemented in a separable manner. The middle quality algorithm used the same framework configuration as the bilinear interpolator, except that a 17x17 tap modified Lanczos kernel was substituted for the bilinear kernel, again implemented in a separable fashion. This filter transition bands were modified to better preserve textures at the expense of slightly more edge ringing. The filter responses are compared in Figure 4. The highest quality and most complex filter used the full Two-channel framework. The cartoon structure of the image was upscaled using a Directional Interpolator and the texture regions were scaled using a Training-Based algorithm.

SPIE-IS&T/ Vol. 7240 72401S-3

Magnitude Spectrum of 1D filters

1.1 1.0

Modified Lanczos response

0.9

Bilinear response

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -0.1 0.000

0.125

0.250

0.375

0.500

Frequency (cycles/sample)

Figure 4. Frequency responses for bilinear and lanczos interpolation kernels.

3.

SUBJECTIVE APPROACHES

The affects of resolution, image size, and viewing distance on picture quality [1][2] are important for understanding the relative quality of 4K, 2K and non-HD displays and at what viewing distances differences are discriminable. One class of display quality metric that predicts many of the main effects is based on the display MTF, making use of the contrast sensitivity function to weight differences in MTF by frequency (e.g., Barten’s SQRI [2]). For upsampling evaluation a first approach could be to formulate a display-plus-algorithm transfer function and apply a metric based on this system MTF. However, adaptive algorithms do not have a simple transfer function or non-linear effects such as visual masking may be difficult to incorporate. An alternative approach is based on image discrimination models that are designed to predict visibility of differences for general image content [3]-[8]. Success for the case of upsampling algorithms may be expected given applications to similar problem domains: compression [9][10], aliasing [11], computer graphic rendering [12], defect detection [13], target identification [14], and optical acuity prediction [15]. In the majority of this work, channel-based models have outperformed single-channel, CSF-based models; and models incorporating visual masking have made better predictions for differences depending on the image background. Specifically toward the design of upscaling algorithms several papers have used the S-CIELAB visual model [7] to evaluate results. Vrhel compared several resolution conversion algorithms, primarily varying the underlying colorspace [16]. For demosaicing, S-CIELAB has also gained in popularity [17] where a demosaiced result is typically compared to a full-resolution groundtruth image (c.f., [18]). Our efforts continue along these lines, aiming to evaluate the degree of agreement between a visual model and subjective results. A third approach toward evaluating upscaling performance is to develop and utilize metrics for individual artifact dimensions and to weight them to predict overall quality [19]. Typical upsampling artifacts are comprised of blur, jagged edges, edge/line width or contour distortions, noise amplification, ringing, or visible transitions due to spatiallyvarying filter parameters. Some of these attributes have been studied individually. One example is the finding that blur discrimination thresholds exhibit a “dipper” shape where discrimination thresholds for blurs slightly above threshold are lower than the absolute blur threshold [20][21]. Blur metrics have also been proposed [25]-[28]. Kayargadde and Martens studied interactions between blur and noise, evaluating approximately gaussian blur combined with varying levels of additive gaussian noise. Observer dissimilarity ratings and categorical scalings of blur, noise and quality were modeled using multidimensional scaling, and then related to a noise metric (standard deviation in uniform areas) and a blur metric (transition width at edges). Johnson and Fairchild conducted a large paired-comparison study in which observers selected the sharper image across resolution, contrast, noise, and sharpening manipulations. They evaluated how well several visual difference models could predict the derived sharpness rankings. Zhang, Allebach and Pizlo conducted sharpness threshold and preference studies by varying unsharp masking parameters. Metric quantities were computed at sharp edges in the images, including: 1) average absolute difference in a 5x5 window, 2) average edge transition width, and 3) the average edge transition slope. Reasonably good correlation with subjective blur was reported. Other quality attributes varying across upsampling algorithms involve edge and line distortions. Naiman found that jaggedness detection varied with edge slope and determined thresholds in terms of sampling rate in pixels/deg [22]. Wu

SPIE-IS&T/ Vol. 7240 72401S-4

and Dalal [23] recently evaluated line/edge waviness, width variations, and raggedness. Sensitivity to all three manipulations had the same basic shape, peaking at approximately 5 cpd, with highest sensitivity to waviness and slightly lower to width variation and raggedness. Hamerly and Springer [24] found a function of the same basic shape but peaking at ~10 cpd. Wu and Dalal defined metrics based on the line/edge profiles and related their values to overall line/edge quality. Specifically on the evaluation of upsampling algorithms, Vicario et al. [29] described subjective studies for mixtures of bilinear and nearest neighbor downsampling/upsampling for two resolution scale factors. Expert observers categorically rated several types of distortions and naïve observers indicated image preference. Interval quality scores were derived via Thurstonian scaling analysis and were related to the categorical ratings by regression. The result was a procedure involving expert observer scaling of artifact magnitudes which could be used to predict upsampled image quality for end-users. A well-tuned visual model has the potential of playing the same role as the expert observers in their approach. In our studies we did not ask observers to rate or scale individual attributes but rather asked them to identify when they could see a difference between algorithm outputs. This eliminates difficulties with labeling the specific quality attributes or with determining the correct weightings between metrics for different attributes. This approach is similar to other image discrimination work [13][14] and extends the approach to upsampling artifact detection. 4.

SUBJECTIVE METHOD

Four observers experienced in upsampling artifact evaluation viewed images on an Aquos GP1U LCD TV at 3 viewing distances. Observers were color normal and had corrected vision of 20/20 or better. Viewing distances are reported as picture heights for a 4096x2160 4K display. The Nyquist frequencies corresponding to each viewing distance were: 19 cpd (1H); 38 cpd (2H); and 57 cpd (3H). The display color gamut and luminance response were approximately equivalent to an sRGB display but with a maximum luminance of 327 cd/m2. A color characterization was determined and used to convert displayed images to absolute CIE XYZ for processing by the visual model. The subjective task used the method of adjustment to determine image discrimination thresholds. A pair of images was presented simultaneously. The left image was always the reference image while the right image was under observer control. Observers evaluated the six pairings as listed in Table 1. As described above, the selected algorithms include bilinear (BI) and Lanczos (LZ) interpolation and a directional training-based approach (DITB). The pairs of images Table 1. Image comparison conditions. Condition 1 2 3 4 5 6

Reference GT GT GT BI BI DITB

Test BI DITB LZ DITB LZ LZ

were either comparisons between a groundtruth image (GT) and an upsampled image, or a comparison between upsampling algorithms. The test image was generated by linearly blending between the reference and test image, with the observer controlling the weight, w, in the range [0-1] according to eqn. (1). At a weight w=0 the adjusted image was maximally different from the reference image while at weight w=1 the adjusted and reference images were identical. An example of the blending method is shown in Figure 5. IADJUSTED = w*IREF + (1-w)*ITEST

(1)

Observers were instructed to adjust the image to where it was just discriminable from the reference. Observers modified the weight in realtime via keypresses, and utilized three step sizes for weight increments: 10%, 5% and 1%. Upon reaching a satisfactory setting, the observer depressed the return key. If the image pair was not discriminable, even at maximal difference, then the observer depressed the escape key. This approach helps the observer find the easiest areas for comparison. Other approaches such as TAFC provide no such help. Since users of the 4K display products are generally expert viewers, experts can be expected to eventually find the difference "hotspots".

SPIE-IS&T/ Vol. 7240 72401S-5

Source images were either high-resolution stills (5 images) or frames from compressed HD (Blu-ray) sequences (3 images). The groundtruth images were cropped to 900x900; and original Blu-ray images, to 450x450. The groundtruth still images were filtered and downsampled 2X using an anti-aliasing filter designed to preserve sharpness as much as possible. The downsampled images were then upsampled using bilinear (BI), Lanczos (LZ), or a directionally-adaptive method (DITB). The source HD images were never downsampled, but similarly upsampled 2X by the same methods. For the high-resolution stills, observers evaluated conditions 1-6; and for the HD source images, conditions 4-6. The Blu-ray images employed in the experiment consisted of two facial close-ups and a graphically rendered scene, shown in Figure 6. These images were of considerably lower sharpness than the high-resolution groundtruth, still images.

Groundtruth

20% bilinear

75% bilinear

100% bilinear

Figure 5. Example of blended images for high-resolution still, between groundtruth and bilinear upsampled 2X.

Figure 6. Images used in the experiment: top row, high resolution stills; bottom row, HD Blu-ray.

SPIE-IS&T/ Vol. 7240 72401S-6

5.

SUBJECTIVE RESULTS

For most image discriminations, observers were able to set blending weights such that the adjusted image was justdiscriminable from the reference. The task difficulty varied with comparison condition, viewing distance and with image. In some cases, particularly at the furthest viewing distance, observers could not distinguish the full-weight test image from the reference. This occurred most often for the DITB vs. LZ comparison and for the blu-ray images at long viewing distances. Averaging across thresholds is problematic when threshold could not be reached. Across all observers, however, satisfactory thresholds were set for all conditions. Rather than discarding values for images where a threshold could not be set, we opted to substitute a weight of zero. Our reasoning is that discarding these images and only averaging valid thresholds gives the misleading interpretation that a given image was, on average, discriminable at the average threshold weight. Substituting zero for cases where threshold was not reached at least pulls the average toward the full-test image and a weight of zero can be interpreted as being not-discriminable from the reference. Threshold image discrimination blending weights are shown in Figure 7a) and b) across comparison conditions for the high-resolution still images. Points represent the mean blending weight for all images of the given comparison condition. Symbol and line types indicate viewing distance in 4K picture heights. The left panel shows the results for one observer and the right for the averages across all observers. Error bars represent ± 1 standard error of the mean. Figure 7c) summarizes the MSE across images for each condition. MSE are largest for the comparisons with groundtruth. The mean observer threshold weights do not show the same pronounced difference seen for MSE, between the groundtruth vs. across-algorithm comparisons. Also, MSE has no ability to make predictions for visibility dependence on viewing distance. As viewing distance increased, discrimination became more difficult and a larger weighting of the test image was needed in order to reach discriminability. A more difficult discrimination corresponds to a lower threshold blending weight, where a larger percentage of the test image was required to reach threshold. Thus, threshold blending weights have higher values when sensitivity was higher to the image difference for a given condition.

a)

OBSERVER 1

1 1H 2H 3H

0.8

c)

100

ALL 0.8

s

s

B

Iv

IT D

B

vs

LZ

LZ

B IT

B IT

I

LZ

D

B

LZ

D

vs

vs

vs

vs

LZ

B IT

LZ

D

I B

LZ

B IT D

Iv B

T

T G

G

T G

B

s

s

vs

vs

vs

LZ

B IT

B IT

LZ

D

I B

D

vs

Iv

T

T

B

s

s

vs

Iv

T

vs

vs

Iv

B

G

T

T

IT D

0

IT D

0

B

0

50

25

Iv B

0.2

G

0.2

T G

0.4

75

G

0.4

B

0.6

G

0.6

100% TEST

b)

MSE

1

G

Sensitivity to Image Difference (Threshold Weight)

100% REF

Figure 7. A) Average threshold weights for one observer, and b) across observers at 1H (black symbols), 2H (gray), and 3H (white). C) Image MSE across comparisons.

SPIE-IS&T/ Vol. 7240 72401S-7

b) 0.8

Stills

0.6

0.6

0.4

0.4

0.2

0.2

c)

HD

25 20

MSE

Mean Observe Threshold Weight

a) 0.8

Stills

15 10

0

5 0 BIDITB

BI-LZ

DITBLZ

BIDITB

BI-LZ

DITBLZ

HD

0 BIDITB

BI-LZ

DITBLZ

Figure 8. Mean observer threshold weights for high-resolution stills (a) and HD blu-ray images (b) at 1H (black), 2H (gray), and 3H (white). Error bars represent +/- 1 standard error of the mean. c) MSE for still and HD images across conditions.

Among the comparison conditions, observers were most sensitive to the difference between groundtruth and the upsampled images. Observers had similarly high sensitivity to the bilinear-lanczos (BI-LZ) condition. The most difficult discrimination occurred for the DITB-LZ condition. In almost all cases, average threshold blending weights trended to lower values as viewing distance increased. One exception occurred for the DITB-LZ comparison. Results for the upsampled HD images are shown in Figure 8b compared to the mean observer results for the still images, Figure 8a. The groundtruth comparisons are omitted because they were not possible for the HD content. Bar color indicates the viewing distance: 1H (black), 2H (gray) and 3H (white). Figure 8c compares the MSE across images for the still and HD images. Observers were less sensitive to the image differences for HD content relative to the high-resolution stills, as indicated by the relative bar height for corresponding viewing distances. Based on the large MSE difference between stills and HD, a much larger fall-off in discriminability might be expected if MSE was predictive. The difference between DITB and LZ upsampling methods for HD content was visible at 1H, requiring almost the full test weight. The discrimination became virtually impossible at 2 and 3H, with many responses indicating that the algorithms were indistinguishable. Bilinear was clearly discriminable from DITB and LZ methods at all viewing distances. Based on the similar MSE for the HD images, all of the HD image comparisons would be expected to be near-indistinguishable. 6.

VISUAL MODEL ANALYSIS

We applied the visual model described in the appendix to the same images seen by observers. Difference images were computed for each condition pair and each image, varying the blending weight between 0 and 1. Example difference images are shown for several image regions in Figure 9. The differences above a fixed threshold are highlighted in the third column. In some cases the highlighted differences correspond predominantly to blur, other times to jagged edges, to ringing, or to detail loss at highlights. The regions most-significantly different between the pairs are well-correlated with the model predicted differences. To evaluate the model against our subjective data requires collapsing the difference images to scalar values. We accomplish this as described in the appendix using Minkowski spatial pooling with an exponent α . Example model scalar responses are shown in Figure 11 for the five high-resolution still images at 1H for the bilinear-lanczos condition, as a function of blending weight. Points represent the average observer threshold weights for the same comparisons. The horizontal line illustrates how a threshold may be applied to the model scalar response to generate threshold predictions for each image discrimination. We determined the threshold which minimized the SSE between the model and observer threshold blending weights. In Figure 10 we plot the SSE corresponding to the best scalar threshold as a function of spatial pooling parameter, α. The minimum SSE occurred for α ≈ 2. The resulting model predictions across viewing distances and comparison are shown in Figure 12a. Points represent the mean thresholds averaged across images. Figure 12B presents a direct comparison

SPIE-IS&T/ Vol. 7240 72401S-8

between the model predictions and average observer threshold weights. Points represent the image discriminability for the model and observer, with higher weights corresponding to cases where sensitivity was higher to the particular image discrimination. Perfect agreement between model and mean observer would have all points fall on the diagonal. Error bars for the model and observer data are standard errors across images. The model approximately agrees with the observer thresholds with regard to how discriminability varies with viewing distance and comparison condition. The largest discrepancies occur for the DITB-LZ comparison at 1H (the point labeled "6" well above the diagonal), where the model over-predicts visibility of the image difference. For the Modelfest threshold dataset [30][8], a pooling parameter, α=3.5, provided the best fit. This discrepancy between spatial pooling parameters may reflect differences between a fixated detection task and one involving image discriminations involving searching.

Bilinear

DILZ

1H

Groundtruth

LZ

1H

Groundtruth

LZ

3H

a)

b)

c)

d)

Figure 9. Example visible difference maps for 4 image difference conditions. The left image is the reference; center is the maximally-different test image; and right is the difference map. Labels above each image indicate the reference, test, and viewing distance. Highlighting in red is used for comparisons a-c) while highlighting in yellow is used for comparison d). Image regions are 150 x 225 pixels.

SPIE-IS&T/ Vol. 7240 72401S-9

10

0.5

0.45

6 4

0.4

SSE

Model Response

8

2

0.35

0 0.5

0.6

0.7

0.8

0.9

Blending Weight

0.3

Figure 11. Scalar model response at 1H as a function of blending weight for the bilinearlanczos condition. Each line represents the response for a single image. Points represent mean observer thresholds for this condition and each image. The dashed horizontal line indicates a candidate threshold for determining model-predicted threshold.

Threshold Weight

0.8 0.6 0.4 0.2 LZ vs B IT D LZ s B Iv T B DI s Iv Z B sL v ITB T G sD v I T G sB v T

G

0

1.5

2

2.5

3

3.5

4

4.5

Figure 10. SSE between model predicted and mean observer threshold blending weights.

b)

1

1

Spatial Pooling Parameter, α

Predicted Threshold Weight

a)

0.25

1 0.8

123 5 41 2 1 2 4 5 3

6

0.6 3

0.4 0.2

6

4 5

6

0 0 0.5 1 Observer Mean Threshold Weight

Figure 12. Visual model predicted image discrimination thresholds for conditions pairs. A) Predictions across conditions for 1H, 2H and 3H. B) comparison of mean model predictions and observer threshold blending weights. Numbering indicates the condition numbers listed in Table 1.

7.

SUMMARY AND CONCLUSIONS

We have presented our initial efforts at predicting differences between upsampling algorithms using a general-purpose visual model. The model does not distinguish between categories of artifacts but rather computes a map of the visible differences between images as well as an overall image difference scale value. We presented image discrimination experiments where observers determined threshold differences between groundtruth and three upsampling methods or between the three upsampling methods. The subjective data demonstrate the expected result that image discrimination becomes more difficult as viewing distance increases. The subjective data also evaluate the specific algorithm differences as a function of distance, which is useful for final upsampling algorithm selection, allowing weighing computational complexity vs. resultant quality. For the studied high-resolution stills, all three algorithms were distinguishable at 3H. For the HD blu-ray content, the more advanced algorithm was not discriminable at 3H. We then

SPIE-IS&T/ Vol. 7240 72401S-10

5

used the image discrimination methodology to evaluate how well a multi-channel visual model could predict the threshold differences. The visible difference maps output by the model are in qualitative agreement with the subjectively visible differences across algorithms and as a function of viewing distance. This supports use of the difference map by algorithm developers when evaluating large sets of images, or for evaluating algorithm differences as a function of viewing distance. We described deriving a scalar model response from the difference map using a spatial pooling technique common in the literature. This scalar summary measure approximately tracked the observer data. We determined the spatial pooling parameter, β, and corresponding scalar threshold which yielded the best fit to the observer data. The resultant pooling parameter was slightly different than the fit of the same parameter to the Modelfest threshold dataset. Overall, we are encouraged by the degree of correspondence between the general-purpose visual model and our subjective data. Further work will aim to resolve the discrepancies in the model predictions, to better understand the relation between image difference models and separate-attribute quality metrics, and to apply the problem to a broader range of upsampling-enhancement configurations. 8.

APPENDIX: VISUAL MODEL DESCRIPTION

We extended a visual model described previously [31][32] which is based on Daly’s VDP [4]. The model incorporates many elements common to other contemporary visual models: optical filtering, cone transduction, a luminance nonlinearity, opponent color conversion, neural CSF filtering, modified cortex channel decomposition, within-channel masking, and spatial pooling. In our application here, the front-end RGB processing is applied but only the luminance channel differences were computed since the upsampling algorithms differed purely in terms of luminance processing. Our previous work incorporated splitting the CSF into optical and neural components based on Barten's [2] optical and CSF models. We report modifications to the VM front-end which make the decomposition more consistent with estimates of the eye’s optical quality. The steps involved are: 1) compute the overall CSF, 2) estimate pupil size, 3) approximate the monochromatic aberrations of the eye, 4) approximate the axial chromatic aberrations given our display spectra, 5) combine axial and monochromatic aberrations, 6) factor the CSF into optical and neural components. Numerous measurements of the eyes optics have been made using double-pass, interferometric, and Hartmann-Shack wavefront sensing [33]. Several studies have characterized the MTF as a function of pupil size, an important factor in applying a VM across varying light levels. In general, as pupil size increases MTF bandwidth decreases as higher-order aberrations come to dominate over diffraction effects. The data of Guirao et al. [33], Deeley et al.[35], and Rovamo et al.[36] describe MTF across pupil size and are in fairly good agreement. Of these, we selected Guirao et al.'s measurements since they encompassed both age (3 brackets) and light level effects (3, 4, and 6 mm pupils). We use Le Grand’s model for predicting pupil size given an adapting luminance level.

d = 5 − 3 tanh (0.4 log L )

(2)

The average monochromatic aberration is captured by the Guirao et al. data. We determine this MTF by interpolating coefficients a (Table 2) and b (Table 3) based on the estimated pupil size, d −f −f MTF ( f , age, d ) = 0.25 * ⎛⎜ 3 * e a + e b ⎞⎟ ⎝ ⎠

(3)

Longitudinal chromatic aberration (LCA) results because the eye is in focus at only one wavelength, approximately 580 nm. The magnitude of LCA is fairly stable across individuals because it is due to the material properties of the eye's components. Transverse chromatic aberration results in a lateral shift across wavelengths, is more varied across individuals, but can be assumed negligible for a pupil aligned with the optical axis. Because display spectra are becoming increasingly narrowband (for power efficiency and to achieve larger color gamut), LCA may become more relevant for display quality evaluation. We employ Marimont and Wandell's [37] LCA model, which describes the OTF as a function of wavelength relative to the diffraction limit. The effect of monochromatic aberrations is applied separately, and for this step we use the Guirao et al. data. The Guirao et al. MTFs are the combination of diffraction and monochromatic aberrations. We compute the coherent diffraction limit for the wavelength used by Guirao et al. in their doublepass experiments (543 nm) and factor the overall MTF into a monochromatic aberration component and a diffraction component.

SPIE-IS&T/ Vol. 7240 72401S-11

Table 2. Coefficients for parameter a for MTF as a function of pupil size and age.

Table 3. Coefficients for parameter b for MTF as a function of pupil size and age.

Age Group

Age Group

Pupil size (mm) 3

4

6

20-30

16.12

10.52

5.89

40-50

10.46

7.15

60-70

5.81

4.72

Pupil size (mm) 3

4

6

20-30

17.5

21.3

19.26

4.48

40-50

27.26

23.58

16.01

3.34

60-70

21.68

19.31

14.14

The polychromatic OTFs can be computed for a given display primary spectra using the LCA-plus-monochromatic aberration model. The overall OTF is a weighted sum of the monochromatic OTF at each wavelength, with the weighting based on the spectral luminosity and radiance at each wavelength. The OTFs resulting for our display primaries are shown in Figure 13B. The curve with negative OTF values corresponds to the OTF of the peak wavelength of the blue primary. As expected, attenuation is higher for the shorter wavelengths while OTF for the red and green primaries are very similar. We approximate the OTF of a luminance channel by weighting the primaries' OTF according to their luminance contribution. This estimated luminance OTF is used when factoring the CSF into optical and neural components. For the overall CSF we use Daly's model [4], which predicts contrast sensitivity as a function of light level, eccentricity, image area, accommodation, and includes an oblique effect. As shown in Figure 14, the model is in reasonable agreement with the Modelfest fixed-size gabor thresholds for the Modelfest viewing conditions. To determine the luminance channel neural filter we divide the CSF by the estimated luminance MTF described above, with the result shown in Figure 14. The cortex transform [38] is modified as described in [4], using dom filters to define 6 spatial bands and fan filters to define 6 orientation channels. We compute both even and odd channel responses. Visual masking is modeled as a spatially-varying threshold elevation factor which is separately applied to each channel. For the channel response we divide the even response by the threshold elevation factor. When no masking occurs, the elevation value, t, equals unity. As masking increases, t becomes >1. The threshold elevation value is based on the response summed across the even and odd responses. This use of even and odd responses makes the masking signal insensitive to phase. Basing channel response on only the even channel filter leads to better localization of the response signal. We use separate elevation maps for the test and reference images.

r'=

(

reven 2 2 , where t = ⎡ reven + rodd ⎢⎣ t

)

1

2

⎤ ⎥⎦

0.7

Differences are computed between the masked reference and test responses for each channel. A difference image is generated by pooling differences across the n channels using Minkowski summation: β⎤ ⎡n D( x, y ) = ⎢∑ r 'ref ( x, y ) − r 'test ( x, y ) ⎥ ⎣ i ⎦

1

β

A scalar model response is generated by spatial pooling of the difference values.

⎡ α⎤ R = ⎢ ∑ D ( x, y ) ⎥ ⎣ x, y ⎦

1

α

A threshold blending weight is determined by finding the weight which leads to a scalar response that exceeds a fixed threshold. A beta value of 3 was used. For a beta value of 3, an alpha value of 3.5 minimized the RMSE to the modelfest threshold dataset.

SPIE-IS&T/ Vol. 7240 72401S-12

0.012

0.8 0.7

0.008

Incoherent diffraction limit (580 nm)

0.6

MTF

Radiance (watts/sr/m2)

0.9

0.01

0.006

0.5 0.4

R,G

0.3

0.004

B

0.2 0.1

0.002

B (at peak)

0

0 400

a)

450

500

550

600

650

700

-0.1 0

750

Wavelength (nm)

5

10

15

20

25

30

Spatial Frequency (cpd)

b)

Figure 13. a) Aquos GP1U RGB spectra. b) Optical MTF estimated for each display primary by combining Marimont & Wandell’s (1994) axial chromatic aberration model and monochromatic aberration estimates from Guirao et al. (1999). Pupil size was 3 mm. 200

Contrast Sensitivity

100 50 20 10 5 2 0.5

1

2

5

10

20

50

Spatial Frequency (cpd)

Figure 14. Conventional CSF (solid line) factored into optical and neural components (dashed). Modelfest fixed-size gabor thresholds are shown as open circles, shifted vertically by a scale factor of 1.5.

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]

Westerink, J. H. D. M. “Perceived sharpness in static and moving pictures,” PhD Thesis at the Eindhoven University of Technology (1991). Barten, P. G. J. "Contrast Sensitivity of the Human Eye and Its Effects on Image Quality", SPIE Press, Bellingham, WA (1999). Watson, A. B. "The cortex transform: rapid computation of simulated neural images," Comp. Vis. Gr. Im. Proc., 39, 311 – 327 (1987). Daly, S. “The visible differences predictor: an algorithm for the assessment of image fidelity," in Digital Images and Human Vision, pp. 179-206, MIT Press (1993) Lubin, J. “The use of psychophysical data and models in the analysis of display system performance,” In A. B. Watson (Ed.), Digital Images and Human Vision, pp. 163-178, Cambridge, Mass.: MIT Press (1993). Ahumada, A. J., Jr. “Computational image quality metrics: A review”, SID Digest, vol. 24, pp. 305-308 (1993) Zhang, X. and B. A. Wandell, “A spatial extension of cielab for digital color image reproduction,” in SID Digest, vol. 27, pp. 731–734 (1996). Watson, A. B. & Ahumada, A. J., Jr. "A standard model for foveal detection of spatial contrast," Journal of Vision, 5(9) (2005) Watson, A. B. “DCTune: A Technique for visual optimization of DCT quantization matrices for individual images,” Society for Information Display Digest of Technical Papers XXIV, 946-949 (1993).

SPIE-IS&T/ Vol. 7240 72401S-13

[10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30]

[31] [32] [33] [34] [35] [36] [37] [38]

Zeng, W., Daly, S. and S. Lei, “An overview of the visual optimization tools in JPEG 2000," Signal Processing: Image Communication, 17(1), 85-104 (2002). Daly, S., "The role of the visual system's orientation mechanisms in the perception of spatial aliasing," ThirtySeventh Asilomar Conference on Signals, Systems and Computers, Vol. 2, pp. 1383-1387 (2003) Ferwerda, J.A., Pattanaik, S., Shirley, P., and Greenberg, D.P. “A model of visual masking for computer graphics,” Proceedings SIGGRAPH, 143-152 (1997). Beard, B. L., Frank, Ahumada, A. J., Jr. "Using vision modeling to define occupational vision standards," Human Factors and Ergonomics Society Annual Meeting, October 15, Denver, CO (2003) Rohaly, A. Ahumada, A. J., Jr., Watson, A. B. "Object detection in natural backgrounds predicted by discrimination performance and models," Vision Research, 37, 3225-3235 (1997) Watson, A.B., Ahumada, A.J., Jr. "Predicting visual acuity from wavefront aberrations," J. of Vision, 8 (4) (2008) Vrhel, M., “Color image resolution conversion,” IEEE Trans. Image Proc., vol. 14, no. 3, pp. 328–333, Mar. 2005.

Li, X. B. Gunturk, L. Zhang, "Image demosaicing: a systematic survey," Proc. SPIE, Visual Communications and Image Processing, vol 6822 (2008). P. Longere, P., X. Zhang, P. B. Delahunt, and D. H. Brainard, “Perceptual assessment of demosaicing algorithm performance,” Proc. IEEE, vol. 90, no. 1, pp. 123–132 (2002). Ahumada, A. J., Jr., C. E. Null "Image quality: A multidimensional problem", in Digital Images and Human Vision, A. B. Watson, ed, pp. 141-148, MIT Press, Cambridge, MA (1993). Watt, R. J. and M. J. Morgan, “The recognition and representation of edge blur: evidence for spatial primitives in human vision,” Vision Res. 23, 1465–1477 (1983). Wuerger, S. M., H. Owens, and S. Westland, "Blur tolerance for luminance and chromatic stimuli," J. Opt. Soc. Am. A 18, 1231-1239 (2001) Naiman, A. “Jagged edges: when is filtering needed?”, ACM Trans. on Graphics, 17(4), 238-258 (1998). Wu, W., Dalal, E. N. "Perception-based line quality measurement," SPIE Proc. 5668, 111-122, (2005). Hamerly, J. R., Springer,R. M. "Raggedness of edges", J. Opt. Soc. of Am., 71(3), 285-288, (1981). Kayargadde, V., Martens, J.B. "Perceptual characterization of images degraded by blur and noise: model", J. Opt. Soc. Am A, 13(6), 1178-1188 (1996). Kayargadde, V., Martens, J.B. "Perceptual characterization of images degraded by blur and noise: experiments", J. Opt. Soc. Am A, 13(6), 1166-1177 (1996). Johnson, G.M. and M.D. Fairchild, "Sharpness Rules", IS&T/SID 8th Color Imaging Conference, Scottsdale, 24-30 (2000). Zhang, B., Allebach, J. P., Pizlo, Z. "Investigation of perceived sharpness and sharpness metrics," SPIE Proc. 5668, 98-110 (2005). Vicario, E., Heynderickx, I., Ferretti, G., and Carrai, P. “Design of a tool to benchmark scaling algorithms on LCD monitors,” SID Digest, 704-707 (2002). Carney, T., Klein, S. A., Tyler, C. W., Silverstein, A. D., Beutter, B., Levi, D., Watson, A. B., Reeves, A. J., Norcia, A. M., Chen, C.-C., Makous, W. & Eckstein, M. P. “The development of an image/threshold database for designing and testing human vision models,” Proceedings, Human Vision, Visual Processing, and Digital Display IX, SPIE, Bellingham, WA, 3644 (1999). Feng, X. Speigle, J., Morimoto, A. “Halftone Quality Evaluation Using Color Visual Models,” PICS, 5-10 (2002) Feng, X., Daly, S. “Vision-based strategy to reduce the perceived color misregistration of image-capturing devices,” Proc. of the IEEE, 90(1), 18 – 27 (2002). Williams, D. R., Brainard, D. H., McMahon, M. J., Navarro, R. "Double-pass and interferometric measures of the optical quality of the eye," J. Opt. Soc. Am. A 11, 3123-3135 (1994). Guirao, Gonzalez, Redondo, Geraghty, Norrby and Artal "Average optical performance of the human eye as a function of age in a normal population", Inv Ophth Vis Sci, Vol 40, 203-213 (1999). Deeley, R. J., Drasdo, N., Charman, W. N. “A simple parametric model of the human ocular modulation transfer function,” Ophthalmic Physiol. Opt. 11, 91–93 (1991). Rovamo, J., Kukkonen, H., Mustonen, J. "Foveal optical modulation transfer function of the human eye at various pupil sizes," J. Opt. Soc. Am. A 15, 2504-2513 (1998). Marimont & Wandell, B. A. "Matching color images: the effects of chromatic aberration," J Opt Soc Am A, 11, 3113-3122 (1994) Watson, A. B. “The cortex transform: Rapid computation of simulated neural images,” Comp. Vis. Gr. and Im. Proc., 39(3), 311-327 (1987).

SPIE-IS&T/ Vol. 7240 72401S-14