Preprint

A SPATIOTEMPORAL NO-REFERENCE VIDEO QUALITY ASSESSMENT MODEL Baris Konuk(1,2), Emin Zerman(1), Gokce Nur(3) and Gozde Bozdagi Akar(1) (1)

Electrical and Electronics Engineering Department Middle East Technical University, Ankara, Turkey (2) Aselsan Inc., Ankara, Turkey (3) Electrical and Electronics Engineering Department Kırıkkale University, Kırıkkale, Turkey

ABSTRACT Many researchers have been developing objective video quality assessment methods due to increasing demand for perceived video quality measurement results by end users to speed-up advancements of multimedia services. However, most of these methods are either Full-Reference (FR) metrics, which require the original video or ReducedReference (RR) metrics, which need some features extracted from the original video. No-Reference (NR) metrics, on the other hand, do not require any information about the original video; hence, are much more suitable for applications like video streaming. This paper presents a novel, objective, NR video quality assessment algorithm. The proposed algorithm is based on utilization of spatial extent of video, temporal extent of video using motion vectors, bit rate, and packet loss ratio. Test results obtained using LIVE video quality database demonstrate the accuracy and robustness of the proposed metric. Index Terms— Video quality assessment (VQA), noreference metric, spatiotemporal information, packet loss, quality of experience (QoE). 1. INTRODUCTION The term, Quality of Experience (QoE), is used as the end user satisfaction on provided video and multimedia services. Therefore, QoE is a rather ill-defined term since various subjective parameters such as individual interest, quality expectation, and video experience of the viewer contribute to QoE [1]. Nonetheless, video quality perceived by end users is said to be the most important part of QoE [2]. Subjective tests, in which typically 15-40 subjects are requested to evaluate the quality of videos, are the best methods in order to analyze perceived video quality. However, subjective tests take too much time and therefore, they are not suitable for online Video Quality Assessment (VQA). Considering this fact about the subjective tests, researchers have started to develop objective VQA metrics to measure video quality in efficient and reliable way.

However, it is realized that developed well-known objective video quality metrics either do not reflect the video quality perceived by end users accurately (e.g., PSNR) or operationally costly (e.g., Video Quality Metric (VQM) [3]). Therefore, there is a requirement for an accurate and efficient VQA model which can produce similar video quality scores to Mean Opinion Scores (MOS) obtained in subjective tests, i.e., it is expected to reflect end user perception. In literature, designed VQA metrics can be classified into three according to the availability of reference, namely Full-Reference (FR), Reduced-Reference (RR) and NoReference (NR). FR VQA metrics are assumed to have a full access to the reference video in uncompressed and unimpaired form. RR VQA metrics do not require the reference video itself but they employ some features extracted from the reference video. Nevertheless, RR metrics need these features to be extracted accurately and to be transferred without any distortion to the receiver side. NR VQA metrics, on the other hand, do not require any information about the reference video. Hence, NR VQA metrics are much more flexible compared to the FR and RR metrics because it may be difficult, if not impossible, to access reference video or its features [4]. In this paper, we present a novel objective, NR VQA model, in which we estimate perceived video quality based on spatiotemporal features and packet loss ratios along with the video bit rates. We evaluated the designed VQA metric on the LIVE video quality database [5]. The organization of this paper is as follows. In Section 2, the proposed VQA metric is introduced. In Section 3, we compare the performance of the designed metric to the other state of the art VQA metrics on LIVE video quality database. Finally, Section 4 concludes the paper. 2. PROPOSED METHOD Human Visual System (HVS) is sensitive to sharp changes in the video, both temporally and spatially [4]. Therefore, the first step of our proposed model is to have information

about the spatiotemporal complexity of a video sequence.. The second step of the proposed model is to consider bit rate and packet loss ratio, which reflect distortions on video in terms of compression and transmission, respectively. In the following sub-sections, the spatial complexity, temporal complexity, bit rate, and packet loss ratio feature measurements of the proposed model, and how the proposed model is formed are discussed, respectively. 2.1. Analysis of Spatial Complexity International Telecommunication Union (ITU) Recommendation P.910 [6] endorses the spatial complexity measure or spatial perceptual information measurement (SI) to be calculated based on a Sobel-filter for each video frame at time n, {Fn}. SI represents the maximum value of the standard deviations of each Sobel-filtered frame at time n, as follows: =

(1)

Since the SI value is defined as the maximum value of the standard deviations over all video frames, peaks that may occur in scene cuts may hinder an SI value to correctly represent the spatial complexity of a video sequence. Therefore, we employ Modified Spatial Information (MSI), which is the average value of the standard deviations of each Sobel filtered frame at time n, {Fn} for the SI measurement: = !

(2)

2.2. Analysis of Temporal Complexity We envisage that the amount of motion in a video is the primary indication of the temporal complexity. Hence, Motion Vectors (MV) are used as basic elements in the temporal complexity analysis in this study. Zero MV ratio for frame n, Zn, is calculated as the percentage of MVs with the value 0 to all MVs in the frame. Then the first feature for temporal complexity, namely the zero MV ratio, Z, is computed by taking the average value of Zn values: #$% &'() #$% &'

" = "=

*

+

∑+ (* "

(3)

(4)

where, counti (MV=0) represents the number of zero MVs in frame i, whereas counti (MV) represent the total number of MVs in frame i and N is the the number of frames of the video being analyzed. Z is said to be used in order to estimate the proportion of still regions in video. When Z is high, the video consists of many static scenes with some local movements. On the

other hand; when Z is low, there may be uniform global movement and/or many local movements in many frames of the video [7]. Although Z discriminates between still sequences and frames with high amount of motion, it does not discriminate between slowly and rapidly changing sequences. Since this discrimination is quite significant in order to understand the temporal complexity of the video, we added the second feature for temporal complexity, namely the mean magnitude of non-zero MVs, M. The mean magnitude of non-zero MVs in frame i, Mi, is calculated by averaging magnitudes of non-zero MVs normalized to the screen size:

=

*

-%

% ∑5( *

.#/01# &' 2∗4

(5)

where Ki represents the number of non-zero MVs in frame i and w and h are the width and height of the screen in pixels, respectively. M is computed by averaging Mi values over all frames, N, as follows:

=

*

+

∑+ (*

(6)

It is here worth noting that we modify the reference software Joint Model 12.3 [8] for H.264 standard and MPEG-2 reference software [9] in order to decode videos and extract mentioned MV features. Having obtained features, MSI, Z and M, to estimate the spatiotemporal complexity of the video, we added the average bit rate parameter, BR, to the proposed VQA metric in order to convert the spatiotemporal complexity into video quality measure. 2.3. Average Bit Rate Assessment The average bit rate, BR, is calculated by dividing the video payload to the video duration, as follows:

67 =

' 8# 9:;#8

' 8# 1 #? ;# @A9 9B

+$ >1 #? #; @A9 9B

∗ 100

(8)

Table 1. Coefficients for compression distortion

Coefficient Name a b c d e f

Value 45.67 8236 -589.7 396500 -50430 4182

2.5. Proposed VQA Metric While designing the proposed VQA metric using the measured features, first of all, we combine features MSI and BR and form parameter S, as follows:

=

&EF G@

(9)

where, S represents spatial distortion. S is defined as spatial distortion because for a constant average bit rate as the spatial complexity, MSI, increases spatial distortion also increases. Similarly, for videos of same spatial complexity, as the average bit rate, BR, increases spatial distortion decreases. Secondly, we combine features Z, M and BR and form parameter T representing the temporal distortion as follows:

H=

*/I∗& G@

(10)

The numerator in (10) reflects the amount of motion, since the term (1-Z) is the proportion of the region in a frame where motion exists and the term M is the mean magnitude of MVs in the same region. Hence, the multiplicative term is used in order to represent the amount of motion. For a constant average bit rate, as the amount of motion in the video increases the temporal distortion increases. Similarly, for videos of same temporal complexity, as the average bit rate increases temporal distortion decreases. We use the LIVE video quality database, which consists of 10 high quality videos with various video contents as reference, and 150 distorted videos (15 distorted videos per reference video). Four different distortion processes exist in the database—MPEG-2 compression, H.264 compression, simulated transmission of H.264 compressed bitstreams through error-prone IP networks, and through error-prone wireless networks [5]. As a starting point, we focus on distortions occurring only in compression. Therefore, we use 10 H.264 compressed videos and employ the Curve Fitting Toolbox of Matlab in order to obtain a second degree polynomial function relating S and T to DMOS estimates for the H.264 compressed bitstreams, DMOSH264, in LIVE video quality database: JKLMNO = + + QH + M + H + RH M

(11)

The coefficient values of (11) are presented in Table 1. In order to overcome this offset in DMOS estimates of MPEG-2 compressed videos, we perform another training step using 10 MPEG-2 compressed videos. First order equation provided in (12) is applied for videos with Z values below 0.01 and DMOS estimates for H.264 and MPEG-2 compressed videos, DMOScomp, are obtained as follows: JK# = 0.97JKLMNO − 5.18, ZR " < 0.01

(12)

Having obtained DMOS estimates for the compressed videos, the next step to take distortions occurring in transmission into account. The packet loss ratio parameter, β, is found for IP and wireless network distorted bitstreams. Noting the fact that, each packet transferred via wireless networks contains much more information about the video than each packet transferred via IP networks, IP and wireless network distorted videos are treated separately. We use 10 IP and 10 wireless network distorted videos for training in order to take transmission distortions into account. The functional form of modification, hTR, to DMOScomp, is the same for both types of distorted videos. However, the coefficients of the modification function hTR in (13) are different for IP and wireless network distorted bitstreams as it can be seen in Table 2. It should be noted hTR will not be applied to H.264 and MPEG-2 compressed videos, since their packet loss ratio is zero. ℎ A@ = ∗ ]^ ∗ =, ZR = > 0

(13)

As a result, we obtain the final functional form of the proposed VQA metric, which takes distortions occurring in both compression and transmission phases into account. It is here worth remembering that, appropriate coefficients according to the employed network type (IP or wireless) must be used while evaluating hTR, and hence the final DMOS estimate, DMOS is calculated as follows: JK = JK# ∗ ℎ A@

(14)

Table 2. Coefficients for transmission distortion

Distortion Type IP Network IP Network Wireless Network Wireless Network

Coefficient Name m n m n

Value 1.379 0.05118 1.077 0.08504

Table 3. Results on LIVE video quality database

3. EXPERIMENTAL RESULTS We utilized LIVE video quality database for both training and testing. A training procedure using cross-validation was used. In training, we used 10 videos for each distortion process (i.e., H.264 compression, MPEG-2 compression, simulated transmission of H.264 compressed bitstreams through error-prone IP networks, and through error-prone wireless networks). Hence, we used 40 of the 150 distorted videos in the LIVE video quality database for the training procedure. Remaining videos in the database were employed in order to evaluate the designed VQA metric on the LIVE video quality database. We compare the performance of our VQA algorithm with the well-known FR metrics (i.e., PSNR, VSNR, SWSSIM, MS-SSIM, VQM, and MOVIE) and a NR VQA method in the compressed domain (i.e., C-VQA). In order to compare performances of these methods, Pearson Correlation Coefficient (PCC) and Spearman Rank Order Correlation Coefficient (SROCC) are used. Fig. 1 shows the scatter plots of the DMOS against the DMOS estimates of the proposed metric on the LIVE video quality database. Comparison of the results of the designed model on the LIVE video quality database with that of the state of the art VQA models on the same database is provided in Table 3. It is here worth noting that C-VQA is developed and tested only for H.264 compressed bit streams. As can be observed from the results, it is impressive that the proposed metric outperforms other metrics for the H.264 compressed videos, even though all of these metrics except C-VQA are FR metrics and do have full access to the reference video in both the uncompressed and unimpaired forms. The performance of the proposed model on all distortion types is competitive with the other metrics except MOVIE. This is also promising keeping in mind that the proposed VQA model is a NR model, i.e., proposed model does not require any information about the reference video. Subjective DMOS vs Predicted DMOS 80

PSNR VSNR [10] SW-SSIM [11] MS-SSIM [12] VQM [3] MOVIE [13] C-VQA [14] Proposed

H.264

All data PCC 0.4385 0.4035 0.6216 0.6896 0.7206 0.5962 0.6919 0.7441 0.6459 0.7236 0.7902 0.8116 0.7927 0.8122 0.6730

H.264

All data SROCC 0.4296 0.3684 0.646 0.6755 0.7086 0.5849 0.7051 0.7361 0.652 0.7026 0.7664 0.789 0.7720 0.8026 0.6697

4. CONCLUSION In this paper, a novel, spatiotemporal structured, NR VQA metric has been proposed. The proposed metric has been evaluated on LIVE video quality database. The evaluation results have presented that the proposed NR VQA model performs competitively with state of the art FR VQA metrics. Using the proposed NR VQA metric, further development of multimedia services and technologies can be supported in a timely fashion. 5. REFERENCES [1] S. Winkler and P. Mohandas, “The Evolution of video quality measurement: from PSNR to hybrid metrics,” IEEE Trans. on Broadcasting, vol. 54, no. 3, pp. 660-668, 2008. [2] S. Winkler, Digital video quality—vision models and metrics, John Wiley & Sons, 2005. [3] M.H. Pinson and S .Wolf, “A new standardized method for objectively measuring video quality,” IEEE Trans. Broadcasting, vol. 50, no. 3, pp. 312-322, Sep. 2004. [4] S.A. Amirshahi and M. Larabi, “Spatial-temporal video quality metric based on an estimation of QoE,” Quality of Multimedia Experience (QoMEX), International Workshop on, pp. 84–89, 2011. [5] K. Seshadrinathan, R. Soundararajan, A.C. Bovik, and L.K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1427 –1441, 2010.

70 Subjective DMOS

Method

60

[6] International Telecommunication Union (ITU), “Subjective video quality assessment methods for multimedia applications,” ITU T Recommendation P.910, 1999.

50

[7] M. Ries and B. Gardlo, "Audiovisual quality estimation for mobile video services," Selected Areas in Communications, IEEE Journal on, vol.28, no.3, pp.501-509, 2010.

40

30 30

35

40

45

50 55 Predicted DMOS

60

65

70

75

Fig. 1. Scatter plot of subjective DMOS against predicted DMOS by the proposed metric

[8] Joint Video Team, “H.264/AVC software coordination,” http://iphome.hhi.de/suehring/tml, 2007.

[9] International Organization for Standardization, “MPEG-2 standard,”http://standards.iso.org/ittf/PubliclyAvailableStandards, 2005.

structural similarity for image quality assessment,” in Signals, Systems and Computers, 2003. Conference Record of the ThirtySeventh Asilomar Conference on, vol. 2, pp. 1398 – 1402, 2003.

[10] D.M. Chandler and S.S. Hemami, “VSNR: A wavelet-based visual signal-to-noise ratio for natural images,” IEEE Transactions on Image Processing 16(9), pp.2284-2298, 2007.

[13] K. Seshadrinathan and A.C. Bovik, “Motion tuned spatiotemporal quality assessment of natural videos,” Image Processing, IEEE Transactions on, vol. 19, no. 2, pp. 335 –350, 2010.

[11] Z. Wang and Q. Li, “Video quality assessment using a statistical model of human visual speed perception,” J. Opt. Soc. Amer. A, vol. 24, no. 12, pp. B61–B69, 2007. [12] Z. Wang, E.P. Simoncelli, and A.C. Bovik, “Multiscale

[14] X. Lin, H. Ma, L. Luo and Y. Chen, “No-reference video quality assessment in the compressed domain,” Consumer Electronics, IEEE Transactions on, vol. 58, no. 2, pp. 505-512, 2012.