OPTIMAL SPATIAL-TEMPORAL WEIGHT PREDICTION FOR INTER ...

2 downloads 287 Views 1MB Size Report
affiliated to Digital Media Solutions Lab, Samsung Information Systems ..... 530. 499. 484. 33.64. 33.69. 33.71. 33.77. 37/38/39. 298. 296. 268. 264. 30.98. 31.02.
The 2010 International Conference on Advanced Technologies for Communications

OPTIMAL SPATIAL-TEMPORAL WEIGHT PREDICTION FOR INTER-FRAME CODING OF H.264/AVC VIDEO SEQUENCES D˜ung T. V˜o1,2 , Chan-Won Seo3 , Daqian Jin4 , Jong-Ki Han3 and Truong Q. Nguyen1 1

Department of Electrical and Computer Engineering, University of California at San Diego, USA. 2 Faculty of Electrical and Electronics, Ho Chi Minh University of Technology, Viet Nam. 3 Department of Information and Communications Engineering, Sejong University, Korea. 4 IP Video Solutions, Motorola, USA. ABSTRACT

The paper proposes a novel method for optimal weight prediction in coding inter-frame of H.264/AVC compressed video sequences. The weights are optimized so that the motion compensated frames is closest to the current frame. This minimizes the residual frame and thus helps reducing the number of bits required to code the residual frame. We first consider the general optimal spatial-temporal weight prediction and then specify the special solution in case of temporal weight prediction using two reference frames (bi-predictive coding). Simulations on a wide range of video sequences are performed to verify the effectiveness of the algorithm on both visual quality and rate-distortion relation.

Encoding Part

x(t) x(t+ti)-

? ME

∆x(t)

- h- T 6 ?

- MC - H

x (t)  HDF 

∆Xq (t)

xp(t)

? h



D

- Q

T−1 

Q−1 

Reconstructing Part

Fig. 1. Block diagram of the H.264/AVC encoder.

h(t0 )

Index Terms— H.264/AVC compression, inter-frame coding, weight prediction.

h(t1 )

1. INTRODUCTION H.264/AVC is the lastest international coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. It achieves approximately 50% bit saving for equivalent perceptual quality relative to the performance of prior standards such as H.263+, MPEG-4 ASP [1]. The block diagram of a conventional H.264/AVC encoder with weight prediction is shown in Fig. 1. The encoder includes two parts: coding part and reconstructing part. At first, previous coded frames x (t+ti ) are motion estimated and motion compensated based on the original current frame x(t). These motion compensated frames are weighted by the coefficients H to get the predicted frame xp (t) and subtracted from x(t) to obtain the residual signal ∆x(t). Then integer transform and quantization are applied to this residual signal to form the coding coefficients which are later coded in the bitstream. In the reconstructing part, the coded residual signal is calculated by inverse quantizing and inverse integer transforming the coding coefficients. It is then added to the motion compensated frame and is filtered by an in-loop deblocking filter HDF to achieve the reconstructed signal x (t). This signal will be buffered as a reference frame for coding later frames. The increase in performance of H.264/AVC over other standards is enabled by the improvement of content prediction (obtaining smaller ∆x(t)) and coding efficiency (compressing ∆x(t) by fewer bits). Its key and novel techniques in content prediction include motion compensation with variable block sizes [2], quarter-pixel This work is funded by Texas Instruments Inc. The first author is now affiliated to Digital Media Solutions Lab, Samsung Information Systems America (Samsung US R&D Center), CA, USA.

978-1-4244-8873-5/10/$26.00 ©2010 IEEE

x (t+t0 )

x(t)

x (t+t1 )

Fig. 2. Bi-predictive coding scheme. motion estimation and multiple reference frames [3]; weight prediction [4]; and in-loop deblocking filter [5]. This paper will focus on discussing novel weight prediction methods to further reducing the residual signal ∆x(t). As observed from Fig. 1, the closer the predicted frame to the original frame, the smaller the residual signal, which means it will need fewer number of bits to encode. If only one reference frame is used for prediction, the current frame is coded as Predicted (P) frame. If two reference frames are used, then the coding scheme is call Bi-predictive (B) frame. Both the P and B coding schemes are extremely useful in reducing the temporal redundancies for coding. Fig. 2 shows the coding scheme for B frames with two reference frames: one previously coded frame and one future coded frame. In general [6], the reference frames can be future frames or previous frames. The number of reference frames can be more than two. In the H264/AVC reference software [7], there are three modes of weight prediction: no weight prediction, ‘implicit’ prediction and ‘explicit’ prediction. For one block in the current frame x(t), if no weight prediction is applied, the best matching blocks in the reference frames x (t+t0 ) and x (t+t1 ) will be averaged and used as the motion estimated block. In the ‘implicit’ weighted prediction, the weighting coefficients h(t0 ) and h(t1 ) are determined based on its

temporal relative position t0 and t1 , such as for the coding scheme in Fig. 2 t1 (1) h(t0 ) = t1 − t0 and −t0 . (2) h(t1 ) = t1 − t0 In the ‘explicit’ weighted prediction, the weighting coefficients h(t0 ) and h(t1 ) are set in the header information of each slice. Weight prediction is useful in coding frames for scene changes or fading sequences. In [4], these weight factors are estimated based on the DC values ratio between the reference frames and the current frame. In this paper, we propose a novel weight prediction method to achieve the smallest ∆x(t) with no DC shift. The paper is organized as follows. At first, a general spatial-temporal adaptive weight prediction is considered and optimized in Section 2. Then Section 3. discusses a special case of temporal weight prediction with a simple solution. Simulations are presented in Section 4. Finally, Section 5 gives the concluding remarks.

where ∆x(t,m,n,ti,mi,ni) = x(t,m,n)−xc (t,m,n,ti,mi,ni)

(6)

and A(ti,mi,ni ),(tj,mj,nj )=

2  ∆x(t,m,n,ti,mi,ni)∆x(t,m,n,tj,mj,nj). MN m,n

The optimal values of h(ti,mi,ni ) which give minimal MSE value are found under the condition (4) using Lagrangian function       L {h(ti,mi,ni )},λ = MSE +λ h(ti,mi,ni ) −1 . (7) ti,mi,ni ti =0

To find the optimal solution, the partial derivative for each variable will be set to zero     ∂L  h(t +λ = 0, ,m ,n )A =  i i i (t , m , n ),(t , m , n ) i i i j j j ∂h(ti,mi,ni )   ti,mi,ni   tj,mj,nj ti =0,tj =0

2. OPTIMAL MOTION COMPENSATED SPATIO-TEMPORAL ADAPTIVE WEIGHT PREDICTION This section discusses an algorithm to optimize the coefficients H so that the predicted frame xp (t) is the best approximation to the original frame x(t). Assume that to predict one pixel of interest in the current frame x(t,m,n), a spatio-temporal cubic of C = (2CT + 1)×(2CM +1)×(2CN +1) pixels is used. The matched pixel of the pixel x(t,m,n) in frame (t+ti ) is at location (mi,ni ) and is denoted as xc (t,m,n,ti,mi,ni ). A linear filter H is applied to these cubics to obtain the predicted frame xp (t)  h(ti,mi,ni )xc (t,m,n,ti,mi,ni ) (3) xp (t,m,n) = ti,mi,ni ti =0

A condition on the weights is set as  h(ti,mi,ni ) = 1

 1  MSE= x(t,m,n)−xp (t,m,n) 2 MN m,n   1   = h(ti,mi,ni) x(t,m,n)−xc (t,m,n,ti,mi,ni) 2 MN m,n t ,m ,n i i

ti=0

i i

ti=0

1 2 = h (ti,mi,ni)A(ti,mi,ni),(ti,mi,ni) + 2t ,m ,n i i

ti=0



h(ti,mi,ni)h(tj,mj,nj)A(ti,mi,ni),(tj,mj,nj)

ti,mi,ni tj,mj,nj ti=tj|mi=mj|ni=nj ti=0,tj=0

(8) The optimal h(ti,mi,ni ) is the root of the system of equations in (8)      A(ti,mi,ni ),(tj,mj,nj ) 1C ×1 h(ti,mi,ni ) 0C  ×1 (9) = 1 λ 11×C  0 where and

ti = 0,

tj = 0

C  = C − WM ×WN .

(10) (11)

The solution h(ti,mi,ni ) of (9) are the optimal weights for multiframe prediction.

If only temporal weight prediction for two referenced frames are used, (3) becomes B frame prediction. The filter is no longer a spatial-temporal filter but just a temporal filter. The notations of h(t0,mi,ni ) and h(t1,mi,ni ) are simplified to h(t0 ) and h(t1 ), respectively. Similarly, the notations of xc (t,m,n,t0,mi,ni ) and xc (t,m,n,t1,mi,ni ) are simplified to xc (t,m,n,t0 ) and xc (t,m,n,t1 ), respectively. Then (3) becomes xp (t,m,n) = h(t0 )xc (t,m,n,t0 ) + h(t1 )xc (t,m,n,t1 ).

 1   = h(ti,mi,ni)∆x(t,m,n,ti,mi,ni) 2 MN m,n t ,m ,n

i

i i

ti =0

3. OPTIMAL WEIGHTS FOR INTER-FRAME CODING

to ensure that the output has the same average value as the input x(t,m,n). The mean square error (MSE) between the original frame and the enhanced frame is determined by

i

i

(4)

ti,mi,ni ti =0

i

 ∂L    = h(ti,mi,ni )−1 = 0.     ∂λ t ,m ,n

(5)

(12)

If h(t0 ) and h(t1 ) are respectively used as different values for pixels of xc (t,m,n,t0 ) and xc (t,m,n,t1 ), these weighting coefficients can optimize the predicted frame closer to the original current frame than single values of h(t0 ) and h(t1 ) for the whole frame. But it requires more bit to code these adaptive weighting coefficients than using single value of weighting coefficients. In this section, because the number of bits used for B frame coding is small, a single value of h(t0 ) and h(t1 ) is applied for all pixels of xc (t,m,n,t0 ) and xc (t,m,n,t1 ), respectively. Optimal weight prediction in this case is achieved by simplifying (9) to      h(t0 ) 0 At0,t0 At0,t1 1 At1,t0 At1,t1 1 h(t1 ) = 0 (13) 1 1 0 λ 1

where   2  x(t,m,n)−xc (t,m,n,ti ) x(t,m,n)−xc (t,m,n,tj ) Ati,tj = MN m,n (14) where xc (t,m,n,tk ) is the pixel at position (m, n) which is motion compensated from referenced frame x (t+tk ). It is easy to see that A is symmetric (15) At0,t1 = At1,t0 . The solution of (13) is the optimal weighting coefficients for B frame coding  At1 ,t1 − At0 ,t1  , h(t0 ) = (16) At0 ,t0 − 2At0 ,t1 + At1 ,t1 h(t ) = 1 − h(t ). 1 0 In (12), if all pixels in frame x (t+t1 ) are considered unit, then the output xp (t) becomes the P frame prediction with scaling factor of h(t0 ) and the offset of h(t1 ) xp (t, m, n) = h(t0 )xc (t,m,n,t0 ) + h(t1 ).

(17)

The optimal scaling factor and offset value are found as in (16) where Ati ,tj is calculated by (14) with xc (t,m,n,t1 ) = 1.

(18)

This is equivalent to the optimal weight factors in [8] [9], where only optimal weight prediction for P frame coding is implemented. 4. SIMULATION RESULTS In this simulation, B frame coding using optimal weight in (16) is compared to bi-predictive coding methods for H.264/AVC with no weight prediction option (h(t0) = h(t1) = 12 ), implicit option (where h(t0) and h(t1) are calculated based on temporal distance as in (1) and (2)) and explicit option (where the weight factors are estimated based on the DC value of the original frames [4]). 197 frames of Crew flashing sequence, Foreman sequence and Foreman sequence with fading out effect are compressed with GOP IBBBP. For B frame, only bi-predictive mode is enabled so that the comparison is not affected by intra and predictive coding modes. Fig. 3 shows the R-D curve results for the simulated sequences. Their detailed values of PSNR and bitrate for the whole sequences are displayed in Table 1. As can be seen from Fig. 3(a) for Crew sequence, the worst R-D curve is for no weight prediction option. Better R-D curves are achieved for implicit option, explicit option and the best R-D curve is for the proposed optimal weight prediction. From Table 1, because the same quantization parameters for I, P, B pictures are used for the simulations, their PSNR value in cases using different bi-predictive options are nearly the same. Using optimal weight prediction help in encoding the sequence with lowest bitrate compared to other bi-predictive options. Using the proposed weight prediction option saves 11 − 12% bitrate compared to using no weight prediction. Fig. 4 shows the encoded frames using bi-predictive coding with explicit and optimal weight prediction options. Both are compressed at the same quantization parameters (QI = 32, QP = 33 and QB = 34). As opposed to the original frame in Fig. 4(a), the encoded frame using optimal weight prediction in Fig. 4(c) has better quality than the encoded frame using explicit weight prediction with more details in the astronauts’ clothes and sharper edges in building corners. The PSNR improvement and bit saving of the encoded frame using optimal weight prediction is 0.41 dB and 3% compared

to those of the encoded frame using explicit weight prediction, respectively. For Foreman sequence with fading out effect, the R-D curves of using explicit and proposed optimal weight prediction options outperform the R-D curves of using no weight prediction and implicit weight prediction options, as in Fig. 3(b). Next to the R-D curve of using explicit option, the R-D curve of using proposed optimal weight prediction option is slightly better. The results in Table 1 show that it saves 15 − 35% bitrate compared to the no weighting prediction option. For a normal sequence like the Foreman sequence, Fig. 3(c) shows that using the proposed optimal options gives nearly the same R-D performance as using implicit weighing prediction while using no weight prediction or explicit option achieves a worse R-D curve. This means the optimal weight prediction can be used in all sequences without decreasing the R-D curve performance. The results in Table 1 show that it saves up to 3% bitrate compared to the no weighting prediction option. The proposed method has a slightly higher complexity than other weight prediction methods with the calculation of Ati ,tj in (14) and weighting factors in (16). 5. CONCLUSIONS The paper proposes a novel weight prediction to improve the coding efficiency of H.264/AVC compression. A general optimal pixelbased spatial-temporal weight prediction is discussed for an arbitrary of reference frame. Then a simplified solution for B frame prediction is also mentioned. This optimal weight prediction achieves better RD relation than other weight prediction methods for sequences with scene change and fading effect. 6. REFERENCES [1] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC Video Coding Standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, pp. 560–576, July 2003. [2] T. Wedi, “Motion Compensation in H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, pp. 577586, July 2003. [3] T. Wiegand and B. Girod, “Multi-Frame Motion-Compensated Prediction for Video Transmission,” Norwell, MA: Kluwer, 2001. [4] J.M. Boyce, “Weighted Prediction in the H.264/MPEG AVC Video Coding Standard,” International IEEE Symposium on Circuits and Systems, vol. 3, pp. 789–792, May 2004. [5] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive Deblocking Filter,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, pp. 614619, July 2003. [6] I.E.G. Richardson, “H.264 and MPEG-4 Video Compression,” Wiley, vol. ISBN 0-470-84837-5, 2003. [7] Karsten Suhring, “H.264/AVC Software Reference,” http://iphome.hhi.de/suehring/tml/. [8] H. Kato and Y. Nakajima, “Weighting Factor Determination Algorithm for H.264/MPEG-4 AVC Weighted Prediction,” IEEE Workshop on Multimedia Signal Processing, pp. 27–30, September 2004. [9] Y. Shen, D. Zhang, C. Huang, and J. Li, “Adaptive Weighted Prediction in Video Coding,” IEEE International Conference on Multimedia and Expo, vol. 1, pp. 427–430, June 2004.

40

40

42

39

39

41

38

38

40

37

36 35 34

39

PSNR (dB)

PSNR (dB)

PSNR (dB)

37

38

35

37

34

33

36 Proposed Explicit Implicit non WP

32 31 30

36

500

1000 1500 Bitrate (kbps)

34

2000

(a) Crew sequence

33

Proposed Explicit Implicit non WP

35

100

200

300

400 500 600 Bitrate (kbps)

700

800

Proposed Explicit Implicit non WP

32

900

31

200

(b) Foreman fade out sequence

400

600 800 Bitrate (kbps)

1000

1200

(c) Foreman sequence

Fig. 3. Comparison in R-D curves for bi-predictive coding with different weight prediction option.

(b) Explicit (32.84 dB, 499.28 kbps)

(a) Original frame

th

Fig. 4. Bi-predictive coding for 19

(c) Proposed (33.27 dB, 484.39 kbps)

frame of Crew sequence.

Table 1. PSNR and bitrate values for bi-predictive coding options. Crew Sequence QP (I/P/B) 22/23/24 27/28/29 32/33/34 37/38/39

QP (I/P/B) 22/23/24 27/28/29 32/33/34 37/38/39

QP (I/P/B) 22/23/24 27/28/29 32/33/34 37/38/39

NWP

1981 1039 545 298

NWP

866 460 250 135

NWP

1085 567 324 194

Bitrate(kbps) Implicit Explicit

1905 1000 530 296

1845 953 499 268

Proposed

NWP

1751 908 484 264

39.81 36.60 33.64 30.98

PSNR(dB) Implicit Explicit

39.82 36.63 33.69 31.02

39.87 36.66 33.71 31.12

Foreman Sequence with Fading out Effect Bitrate(kbps) PSNR(dB) Implicit Explicit Proposed NWP Implicit Explicit

777 418 233 132

591 318 185 117

563 306 181 116

41.24 38.67 36.35 34.37

Foreman Sequence Bitrate(kbps) Implicit Explicit Proposed NWP

1023 533 308 189

1131 602 342 204

1031 540 314 195

39.95 37.10 34.45 31.88

41.44 38.91 36.55 34.56

41.96 39.46 37.07 34.75

PSNR(dB) Implicit Explicit

39.96 37.13 34.48 31.92

39.90 37.03 34.38 31.81

Proposed

39.88 36.69 33.77 31.11

Proposed

41.85 39.33 37.05 34.78

Proposed

39.93 37.07 34.44 31.88