An Adaptive Energy-Efficient Stream Decoding System ... - IEEE Xplore

2 downloads 0 Views 1MB Size Report
An Adaptive Energy-Efficient Stream Decoding. System for Cloud Multimedia Network on Multicore Architectures. Chin-Feng Lai, Member, IEEE, Ying-Xun Lai, ...
194

IEEE SYSTEMS JOURNAL, VOL. 8, NO. 1, MARCH 2014

An Adaptive Energy-Efficient Stream Decoding System for Cloud Multimedia Network on Multicore Architectures Chin-Feng Lai, Member, IEEE, Ying-Xun Lai, Ming-Shi Wang, and Jian-Wei Niu

Abstract—As the technology of applying a cloud network to cloud multimedia matures, a solution to the present demand for high quality and diversified cloud multimedia can be provided. Moreover, the prevalence of intelligent mobile phones and wireless networks can allow users to use network services at home and obtain multimedia content easily through mobile devices, thus achieving ubiquitous network cloud multimedia service. However, how to meet the users’ demand for high quality and diversified cloud multimedia with handheld devices, which have limited arithmetic capability and power, is an interesting and challenging study. This paper proposed an adaptive energy-efficient stream decoding system for cloud multimedia network on multicore systems. The overall dynamic energy-efficient design was planned in a systematic viewpoint, and the temporary storage block for cloud multimedia was controlled, while considering the instancy of cloud multimedia streaming transmission without destroying the original decoding process. The system decoding schedule time was adjusted to reduce the cloud multimedia data dependence by dynamic-voltage-frequency-scaling system. The adaptive energyefficient stream decoding system was proven to be feasible for the cloud multimedia network. Index Terms—Cloud multimedia network, energy efficient, multimedia sharing.

I. I NTRODUCTION

A

S CLOUD network quality and hardware grade are improved, the application of cloud multimedia inclines to more massive data operations, and the systems gradually adopt a multicore architecture, which considers the upper limits of

Manuscript received July 30, 2012; revised December 21, 2012; accepted March 3, 2013. Date of publication October 3, 2013; date of current version February 5, 2014. This work was supported in part by the National Science Council and Science Park Administration of the Republic of China, Taiwan, under Contracts NSC 101-2628-E-194-003-MY3, 101-2221-E-197-008-MY3, 101-2219-E-197-004, and 101MG07-2, by the NSFC under Grants 61170296 and 60873241, and by the Program for New Century Excellent Talents in University under Grant 291184. C.-F. Lai is an assistant professor at Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621, Taiwan (e-mail: [email protected]). Y.-X. Lai and M.-S. Wang are with the Department of Engineering Science, National Cheng Kung University, Tainan 701, Taiwan (e-mail: eetaddy@ gmail.com; [email protected]). J.-W. Niu is a professor and Ph.D. advisor of School of Computer Science and Engineering, Beihang University, Beijing 100191, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSYST.2013.2281629

system frequency and hardware design. Thus, a cloud multimedia system developer is confronted with many challenges. 1) Data dependence exists in numerous applications, particularly in cloud multimedia applications. For example, the previous segment of data is usually referred to for encoding and decoding in the cloud multimedia encoding and decoding processes; however, differing from the application of a single-core platform, how to averagely and effectively distribute data to different cores for processing in order to avoid data dependence must be considered in a multicore platform. 2) In comparison to a single-core platform, the multicore platform requires a good power management mechanism to avoid excessive power consumption, particularly for systems such as handheld devices or battery devices. The dynamic voltage frequency scaling (DVFS) provides a good opportunity for this subject, as power consumption can be reduced in the applications of lower computation complexity by dynamically reducing the voltage or frequency of the system. However, how to estimate the system voltage or frequency for energy-saving effects of the executed application program without resulting in underestimation must be considered (end users feel unable or uncomfortable with application services). 3) An important and realistic problem is overhead; how long and how much should a system development designer change in the present single-core system platform to implement the overall architecture must be considered by developers and manufacturers. In order to address the aforementioned problems, this paper proposed an energy-efficient cloud multimedia stream decoding system. The system considers the overall design architecture at the system level and implements parallel decoding in the concept of front wave without destroying the original architecture. A simple but effective method is used: Eliminate time slack and estimation errors within the tolerant range of the end user by combining a buffer management mechanism, and regulate the voltage and frequency by the mechanism of power oriented online with offline consideration of parallel processing. The remainder of this paper is organized as follows. Section II reviews related works on DVFS and parallel architecture, as well as the decoding process of single-core architecture. Section III introduces the energy-efficient cloud multimedia stream decoding system proposed in this paper, including the overall architecture and module design. Section IV describes the implementation on

1932-8184 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LAI et al.: STREAM DECODING SYSTEM FOR CLOUD MULTIMEDIA NETWORK

an experiment platform, with power consumption estimations. Section V offers conclusions. II. R ELATED W ORK The related works are divided into three major types: studies related to parallel encoding and decoding and DVFS system introduction and study. The cloud multimedia network is also reviewed in this section. A. Parallel Decoding The parallel processing architecture refers to calculating different tasks at the same time point, which is regarded as a method for effectively and rapidly shortening processing time, and is gradually applied to cloud multimedia system processing. However, there is a data dependence problem in the parallel processing of image encoding and decoding processing. Taking the decoding processing of H.264/AVC as an example, the picture of H.264/AVC can be divided into I frame, P frame, and B frame, where P frame refers to the picture data of I frame for decoding and B frame refers to the picture data of I frame and P frame; however, such a picture reference decoding mode will result in data collision or awaiting data in the parallel architecture. Therefore, there are many studies of encoding and decoding in a parallel architecture, and if a complete decoded picture is used as the separation point, they can be classified into two major types: One is preprocessing parallel, and the other is internal processing parallel. 1) Front Wave Parallel: The front wave parallel refers to the beforehand parallel distribution of decoded picture data before decoding calculation by the decoder, where each group of pictures (GOP) of a film segment is segmented [1], [2]. Each GOP is distributed to each processor for decoding, as each GOP can be decoded independently. Thus, the processing load can increase linear velocity in this decoding mode [3]. However, this technology requires a very large memory space to store the decoded GOP segments. As the segmentation of GOP requires a large memory space, using frame base for parallel encoding and decoding calculation has been successively studied. The parallel decoding mode allocates each single picture to a respective processor for calculation, where picture allocation is handled by preanalytic arrangement calculation or competitive algorithm, in order to determine two pictures without data dependence for concurrent operation. Flierl and Girod [4] proposed the B Frame parallel decoding mode, whose main concept was that B frame was not referred to by other pictures and, thus, B frame could be allocated to different processors for decoding. However, for H.264 decoding, B frame could be other reference pictures. Therefore, it was inapplicable to H.264 decoding. Azevedo et al. [5] proposed a cross-frame decoding concept, where a 3-D-wave mode searches for a decipherable macroblock in different decoded pictures, thus solving scaling and data dependence problems. Pang proposed a heuristic scheduling framework for scheduling GOP to accelerate the overall decoding speed. 2) Internal Processing Parallel: Different preprocessing architectures partition the data before processing by a system, the internal processing parallel hands the picture to the system,

195

which plans the parallel processing mode. The advantage of this partitioning is that better parallel effect can be obtained in the concept of scheduling by system planning; however, the defect is that the overall decoding structure requires modification. A study partitioned the H.264 code by slice [6], as the slice was the minimum independent decoding unit in H.264. Partitioning the decoded picture by slice resulted in good parallel effect; however, as the division of a slice covered macroblocks to a complete frame, memory utilization, scalability, and balance were not satisfied. Roitzsch [7] proposed the slice-balancing algorithm, which could improve scalability; however, this part aimed at the encoder end and was inapplicable to the decoder end. Jike et al. [8] calculated the macroblock of each picture in the concept of scheduling to attain the concept of scheduling. Van der Tol et al. [9] divided each picture decoding process into different tasks and allocated different tasks to different decoders to form the parallel processing architecture. Cheng et al. [10] accelerated encoding and decoding through the parallel instruction set. B. Dynamic Voltage and Frequency Scaling Good power management is a necessary condition for numerous expendable electronics, particularly for handheld systems or battery-based devices, and many power management mechanisms and designs have been proposed. The static power source management system was divided into action and sleep states from the beginning, where normal voltage and frequency are supplied for calculation when the system is processing, and the system voltage and frequency are set as the minimum values to reduce energy consumption when the system is idle. Such a method is universally used for various electronic equipment, and the states are diversified, including prepare, idle, etc. Such a static power management mode can save energy when the system is in the idle state, but the general operational data processing of the system cannot be improved. Therefore, the dynamic power management mechanism is proposed, where the voltage or frequency can be dynamically reduced for the processed data volume in a system to complete the application program before the deadline, and considers the influence of voltage and frequency on the system power, according to the power consumption of the processor in complementary metal–oxide–semiconductor (CMOS) process technology 2 ∗f P = Cef f ∗ Vdd

(1)

where Cef f is the effective switched capacitance, Vdd is the operating voltage, and f is the operating frequency. The relationship of frequency and voltage can be expressed by the following equation: f =K

(Vdd − Vt )a Vdd

(2)

2 where K is a constant, Vt is the threshold voltage, and a = 1.2˜ is the electron coefficient [11]. The time spent on executing a task is called workload, defined as TP roc , and TP roc can be calculated by the following equation: TP roc = C/f.

(3)

196

IEEE SYSTEMS JOURNAL, VOL. 8, NO. 1, MARCH 2014

C is the number of cycles spent on this task in system calculation, and (2) is substituted in (3) to obtain TP roc = C

Vdd . K ∗ (Vdd − Vt )a

(4)

According to the energy equation E = P ∗ TP roc .

(5)

Equations (1) and (4) are substituted in (6) to obtain 2 . E ∝ Cef f ∗ Vdd

(6)

According to the aforesaid equations, energy consumption can be successfully reduced by changing the voltage, the frequency can determine the operating time but will not influence energy consumption, and voltage is correlated with frequency. Therefore, the system operating time can be determined and corrected by changing the frequency. The energy consumption can be reduced according to related voltage for energy saving.

C. DVFS Related Work Previous work regarding how to regulate the voltage or frequency for energy saving is introduced hereinafter and divided into two parts, system correction and data prediction correction. Most studies focus on the DVFS system, and when the workload of executing the task of the system is less than the deadline, the system will be in an idle state, which is called time slack. Most studies eliminate the time slack by scheduling the instruction set. Lai et al. [12] used a buffer mechanism to adjust the deadline for a single task, multiple subtasks, and multitasks in order to eliminate the time slack. Zhou et al. [13] designed the scheduling for base case and worse case scenarios at different occurrence rates. Chen et al. [14] conducted stochastic modeling according to up level and low level for the MPEG2 decoding system. Tsou et al. [15] scheduled dependent and independent tasks to eliminate time slack. Aside from internal correction for systems, some studies aim at the workload prediction mechanism, where the main concept is that, for the cloud multimedia decoding system, the number of cycles for the next picture is unknown and, thus, it should be predicted for estimation and correction. In studies [16] and [17], the energy-saving effect is achieved by dynamic thermal management technology, as based on the mean time interval of each GOP. Pouwelse et al. [18] conducted estimation according to the time spent on the previous frame and picture data size based on each frame and predicted the overall decoding time of the system for secondary correction from the former time and respective complexity of the two periods when the picture was half decoded. The studies [19]–[21] implemented the first prediction of the system voltage and frequency based on the time spent on the previously decoded picture and the data volume and then corrected the system voltage and frequency according to different mechanisms.

Fig. 1.

Adaptive energy-efficient stream decoding system architecture.

D. Cloud Multimedia Network Differing from the multimedia-aware cloud, the cloud-aware multimedia inclines to the streaming service of front end cloud multimedia data. A cloud multimedia video file is usually massive, and in order to conform to the cloud multimedia streaming service of real-time video, cloud multimedia streaming separates the file into several packets. Thus, the client side can view the cloud multimedia information instantly in “Play as received” mode, rather than “general download.” Cloud multimedia streaming usually contains several major elements: encoder, decoder, streaming server, and player. The streaming technology can be classified into two types: Web server and streaming server. The Web server is the general Web server, which uses Hyper Text Transfer Protocol (HTTP) as the communication medium. Streaming in this mode is called HTTP streaming, which is convenient, as a special streaming server is not required; thus, it is also known as serverless streaming. Since the HTTP protocol uses Transmission Control Protocol, a handshake must be implemented before transmission, and retransmission is required if a data packet is lost, causing severe delays. Therefore, an HTTP live streaming protocol is proposed for Apple Inc., which provides streaming service that can interrupt packets. Bavier et al. [19] used HTTP live streaming transmission mode for transcoding cloud multimedia files to the mobile device end and attained live streaming. III. S YSTEM A RCHITECTURE This section introduces the designed parallel DVFS decoding system in detail and offers design and modeling for cloud multimedia streaming processing in a parallel architecture and DVFS mechanism. A. System Overview The system architecture proposed in this paper is as shown in Fig. 1. When the cloud multimedia streaming data enter the system platform, the main processing unit (MPU) sequences the cloud multimedia streaming data in parallel and conducts DVFS system decoding prediction according to the sequenced film dependence and film format. The digital signal processor (DSP) system mostly encodes and decodes the video part on a multicore platform; thus, this paper discusses parallel decoding

LAI et al.: STREAM DECODING SYSTEM FOR CLOUD MULTIMEDIA NETWORK

197

D. DVFS Algorithm for Independent Frame

Fig. 2.

DVFS module function architecture.

on the platform of a single MPU, with multi-DSP core architecture. The MPU controls system parallel planning and DVFS prediction and setting, and the parallel DVFS system structure processing can be completed without changing the DSP decoding process by this front end processing design. The multicore platform discussed here is also applicable to parallel decoding design on other platforms. B. Parallel DVFS on Stream Decoding According to previous studies, the two principal methods for reducing the energy consumption of the cloud multimedia decoding system are as follows: 1) Eliminate time slack, and 2) predict the required processing load for the next picture, and accurately set voltage frequency values. In order to attain the aforementioned two goals, this study attempted to use a simple and practical concept to realize our method, by using front end and back end buffers to implement the overall DVFS architecture. The front end buffer could realize the parallel mechanism and eliminate time slack, and then, the fixed decoding time of each picture is defined as a deadline. The difference is that the schedule of the deadline remains fixed, but the power approaching mode is used to predict the system voltage and frequency values. Moreover, the system voltage and frequency are corrected according to the weights of various tasks in the decoder. C. DVFS Module Implementation Architecture The mechanism combining offline with online is used in this paper to reduce the prediction error rate, with a brief introduction to the system as shown in Fig. 2. This study divided DVFS into two parts in order to implement the overall design architecture. The offline mechanism is statistically completed in the MPU part, while the voltage and frequency are dynamically regulated in the decoding process in the DSP part to implement the online mechanism. The DVFS model determines the initial voltage and frequency values according to the format of the decoded picture, the size of the previously decoded picture, and the decoding time before the decoding process of the system. The DSP end determines the dynamically regulated voltage and frequency according to the time spent on the respective functions and the dependent data in the decoding process.

When using system voltage and frequency to achieve an energy-saving system, the DVFS hardware module in the system usually provides several sets of preset voltages and corresponding frequencies for programmers to control system voltage and frequency by software. It sets υ = {V1 , . . . , Vmax } as the regular voltage for hardVx } as the regular system freware and F = {f1Vx , . . . , fmax Vx executes the obtained Tmax in quency under voltage Vx if fmax the system at the beginning, and according to (3), the workload can be determined by the system frequency and the number of cycles, in order to obtain C = Tmax ∗ fmax = Tn ∗ fn |Tn < Tdead < Tn+1

(7)

where fn is the selected frequency and Tn is the corresponding workload. Many previous prediction criteria are based on the worse case. The most ideal case is that the predicted workload just conforms to the applied deadline; however, Tn = Tdead cannot be met perfectly at the frequency provided by the system, and thus, a time slack occurs. The study used a buffer mechanism to dynamically change the workload to successfully solve this problem. This study referred to this repairing concept but considered that, when Tdead − Tn is far larger than T(n+1) − Tdead in the previous algorithm, the system wait occurs in the parallel decoding of data dependence. Therefore, this study proposed a time trend prediction different from the worse case prediction principle and searched for the frequency closest to Tdead , i.e., the following inequality: |Tdead − Tn | < |Tdead − Tn−1 |&&|Tdead − Tn+1 |.

(8)

The estimation error can be corrected by the back end buffer (BEF) mechanism; the picture is temporarily stored in the buffer by latency and displayed until the picture is completely decoded. According to the size of BEF, when T is the interval, the worst case is Tdead − Tn > 0.5T.

(9)

The difference in the prediction time interval will not exceed half of the time interval T , provided that the system determines the estimation error. The next time interval can be corrected as Tnew = (Tdead − Tn ) + T.

(10)

Tnew is substituted in (8) to repair the predicted system voltage and frequency in the next time interval. The above designs are based on the predicted cycle number, and there is deficiency in accuracy. Therefore, this study made corrections in the decoding process. According to Lai et al. [12], this study found that ED should decode the entire picture and is obviously related to the decoding time and the compressed data size. There is no data dependence. First, the time spent on ED is recorded, and prediction estimation is conducted according to the average weight of ED. Referring to [18], the system voltage and frequency values are corrected again for accuracy according to the decoding time of the former half

198

IEEE SYSTEMS JOURNAL, VOL. 8, NO. 1, MARCH 2014

Fig. 3. DVFS algorithm for independent frame.

and their specific gravity when the picture is decoded to 50% as Fig. 3.

Fig. 4.

DVFS algorithm for dependent frame.

Fig. 5.

Implementation system architecture.

E. DVFS Algorithm for Dependent Frame This section considers how to set the corresponding voltage and frequency for data dependence resulting from the instances required by the cloud multimedia streaming system. According to the equations in the previous section, the system required cycle number changes in data dependence decoding, and thus, inaccuracy of prediction resulted. Therefore, by eliminating the uncertainty resulting from latency, an accurate prediction can be achieved. First, it is known that there is no data dependence in the ED decoding process, and the system begins decoding when the third row decoding of the reference frame is completed ref ref dec = TED − T3C TED

(11)

ref dec is the time spent on ED decoding, TED is the dewhere TED ref coding time of reference picture ED, and T3C is the first three ref ref and T3C rows’ decoding time of the reference picture. TED can be determined according to the system average weight, according to

WED ref TED =T ∗ Wdec   ref ref ∗ T3C = T − TED

(12) 3 Ctotal

(13)

where Ctotal is the full line number of the picture. The required voltage and frequency for the decoded picture in the ED process can be determined, and thus, data dependence can be corrected in ideal decoding cases. However, the decoding speed varies with picture formats in actual decoding, and thus, DSP intercommunication is used to record the decoded grid number of the reference picture and the system voltage and frequency settings. The following equation is used to retain the decoded line number of the reference picture and decoded picture greater than 3, thus solving the system wait problem as shown in Fig. 4:    ref fn , for Cnref − Cndec ≥ 3   (14) fndec = ref fn−1 , for Cnref − Cndec < 3. IV. I MPLEMENTATION AND A NALYSIS In order to validate our system structure, this study implemented a dual DSP system on a parallel architecture core (PAC) Duo platform to validate the prediction mechanism. PAC Duo

is a heterogeneous multicore system on chip (SoC), which consists of one ARM926EJ as the main processor, and two parallel architecture core digital signal processors (PACDSPs), which are developed by SoC Technology Center, Industrial Technology Research Institute (STC/ITRI), Taiwan, as the DSPs. The PACDSP is a 32-b fixed-point DSP with five-way very long instruction word pipeline with 32 kB of instruction memory and 64 kB of local data memory. The improvement of system efficiency was tested in parallel architecture, and the power consumption was measured and compared with the prediction mode of the worse case to discuss system power consumption. A. Environment Description This study adopted a PAC Duo platform and integrated the proposed power sensing system into Android OpenCORE. The implementation system structure and operation flow are as shown in Fig. 5. The upper application program Android Package calls the OpenCORE cloud multimedia framework to play video when the system is in operation, and then, OpenCORE is in charge of coprocessing with DSP. The DVFS predictor predicts and selects the appropriate DVFS level according to the decoding load. Then, input/output control transfers instructions to DSP PM driver, thus controlling DSP voltage frequency for

LAI et al.: STREAM DECODING SYSTEM FOR CLOUD MULTIMEDIA NETWORK

Fig. 6.

Relation between frame size and decoding time.

Fig. 7.

Energy consumption for different bit rates.

199

Fig. 8. Deadline miss and energy consumption.

the DVFS controller to attain the goal of dynamic adjustment. Finally, OpenCORE initiates DSP for image decoding. B. Relation Between Decoding Time and Film Size This study used common test bit streams to obtain the relation between frame size and decoding time, as shown in Fig. 6. All bit streams were 30 fps and adopted common intermediate format (352288) resolution. There were 300 frames. C. Influence of Bit Rate on Energy Consumption Different bit rates influence the complexity of decoding to some extent. With a higher bit rate, the frame size is larger, and the decoding time is longer. This study tested the energy consumption at bit rates of 200, 400, and 600 kb/s, as shown in Fig. 7. As predicted, the higher bit rate consumes more power; however, our prediction model can still reduce the energy consumption by 36.2%–41.9% at the three bit rates. D. Analysis of Deadline Miss If a frame fails to be decoded before the deadline, a deadline miss occurs. Deadline miss can be regarded as an indicator for adjusting the DVFS algorithm, as shown in Fig. 8. There

Fig. 9. Deadline miss distribution.

are varying degrees of deadline misses when the proposed prediction model is used. It should be noted that a higher deadline miss ratio does not directly represent inaccuracy of the prediction model. When the prediction model overrates the DSP load, although the deadline miss ratio is low, the electric energy cannot be better saved. Taking the bit stream of news as an example, the deadline ratio is minimum, while the power consumption is maximum. In addition, the statistics of the extent of the error in the prediction of deadline miss are as shown in Fig. 9, where it is

200

IEEE SYSTEMS JOURNAL, VOL. 8, NO. 1, MARCH 2014

Fig. 10. Computing count distribution.

observed that, aside from the inaccurate news of the prediction model, the errors of other bit streams are mostly less than 5%, and there is no significant influence on actual play. E. Analysis of Decoding Time Finally, the decoding time of each frame, after using the DVFS mechanism, is divided by the preset DEADLINE to determine the distribution, as shown in Fig. 10. When the distribution is more concentrated in the interval nearby 100%, it means that the prediction mechanism is more accurate. In this test, most distribution of the five bit streams, other than news, is close to 100% and the adjacent interval, suggesting that the prediction mechanism can work normally. The news prediction is inaccurate because the timing difference between the lowest two DVFS level intervals is large. When the lowest DVFS level just exceeds DEADLINE, only a slightly higher level can be selected; thus, the decoding time is much shorter than DEADLINE, and this distribution is presented. V. C ONCLUSION This paper has proposed an adaptive energy-efficient stream decoding system for cloud multimedia network on multicore architectures, combined with multicore scheduling and a DVFS mechanism to provide high efficiency and low power consumption of the cloud multimedia decoding mechanism. The DVFS reduces system power consumption, and the scheduling and correction calculation were applicable to solving cloud multimedia data dependence. The mechanism was implemented in an android system, and its implementation efficiency on the platform is analyzed. The experimental result showed that, in comparison to general systems, this system can successfully reduce 36.2%–41.9% of system power consumption. It can be applied to network bandwidth and intersystem DVFS mechanism calculation or combined with the cloud system concept to attain the goal of high efficiency and low power consumption in the future.

R EFERENCES [1] T. Olivares, F. J. Quiles, P. Cuenca, L. Orozco-Barbosa, and I. Ahmad, “Study of data distribution techniques for the implementation of an MPEG-2 video encoder,” in Proc. Parallell Distrib. Comput. Syst., Nov. 3–6, 1999, pp. 537–542. [2] A. Bilas, J. Fritts, and J. Singh, “Real time parallel MPEG-2 decoding in software,” Comput. Sci. Dept., Princeton Univ., Princeton, NJ, USA, Tech. Rep. TR-516-96, 1996. [3] D. Farin, N. Mache, and H. N. Peter, “SAMPEG: A Scene Adaptive Parallel MPEG-2 software encoder,” in Proc. SPIE Visual Commun. Image Process., 2001, pp. 272–283. [4] M. Flierl and B. Girod, “Generalized B pictures and the draft H.264/AVC video-compression standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 587–597, Jul. 2003. [5] A. Azevedo, C. Meenderinck, B. Juurlink, A. Terechko, J. Hoogerbrugge, M. Alvarez, and A. Rammirez, “Parallel H.264 decoding on an embedded multicore processor,” in Proc. 4th Int. Conf. HIPEAC, Jan. 2009, pp. 404–418. [6] A. Rodriguez, A. Gonzalez, and M. P. Malumbres, “Hierarchical parallelization of an H.264/AVC video encoder,” in Proc. Int. Symp. Parallel Comput. Elect. Eng., 2006, pp. 363–368. [7] M. Roitzsch, “Slice-balancing H.264 video encoding for improved scalability of multicore decoding,” in Proc. 27th IEEE RTSS Work Progress, 2006, pp. 77–80. [8] C. Jike, N. Satish, B. Catanzaro, K. Ravindran, and K. Keutzer, “Efficient parallelization of H.264 decoding with macro block level scheduling,” in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2007, pp. 1874–1877. [9] E. B. van der Tol, E. G. T. Jaspers, and R. H. Gelderblom, “Mapping of H.264 decoding on a multiprocessor architecture,” in Proc. 6th Int. Conf. Adv. Mobile Comput. Multimedia, Nov. 2008, pp. 40–49. [10] R.-S. Cheng, C.-H. Lin, J.-L. Chen, and H.-C. Chao, “Improving transmission quality of MPEG video stream by SCTP multi-streaming and differential RED,” J. Supercomput., vol. 62, no. 1, pp. 68–83, Oct. 2012. [11] Y. Zhang, X. S. Hu, and D. Z. Chen, “Task scheduling and voltage selection for energy minimization,” in Proc. 39th DAC, 2002, pp. 183–188. [12] Y.-X. Lai, C.-F. Lai, C.-C. Hu, H.-C. Chao, and Y.-M. Huang, “A personalized mobile IPTV system with seamless video reconstruction algorithm in cloud networks,” Int. J. Commun. Syst., vol. 24, no. 10, pp. 1375–1387, Oct. 2011. [13] L. Zhou, M. Chen, Z. Yu, J. Rodrigues, and H.-C. Chao, “Cross-layer wireless video adaptation: Tradeoff between distortion and delay computer communications,” J. Comput. Commun., vol. 33, no. 14, pp. 1615– 1622, Sep. 2010. [14] W.-M. Chen, C.-J. Lai, H.-C. Wang, H.-C. Chao, and C.-H. Lo, “H.264 video watermarking with secret image sharing IET image processing,” IET Image Process., vol. 5, no. 4, pp. 349–354, Jul. 2011. [15] P.-C. Tsou, Y.-H. Lin, H.-H. Cho, J.-M. Chang, and H.-C. Chao, “Implement of an efficient system to reduce power consumption,” in Proc. 13th ICACT, Phoenix Park, Korea, Feb. 13–16, 2011, pp. 1409–1413.

LAI et al.: STREAM DECODING SYSTEM FOR CLOUD MULTIMEDIA NETWORK

[16] L. Wonbok, K. Patel, and M. Pedram, “GOP-level dynamic thermal management in MPEG-2 decoding,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 6, pp. 662–672, Jun. 2008. [17] Y. Inchoon, K. L. Heung, J. K. Eun, and H. Y. Ki, “Effective dynamic thermal management for MPEG-4 decoding,” in Proc. Int. Conf. Comput. Des., Oct. 2007, pp. 623–628. [18] J. Pouwelse, K. Langendoen, and H. Sips, “Dynamic voltage scaling on a low-power microprocessor,” in Proc. 7th Annu. Int. Conf. Mobile Comput. Netw., Jul. 2001, pp. 251–259. [19] A. C. Bavier, A. B. Montz, and L. L. Peterson, “Predicting MPEG execution times,” in Proc. ACM SIGMETRICS Joint Int. Conf. Meas. Modeling Comput. Syst., 1998, pp. 131–140. [20] D. Son, C. Yu, and H. Kim, “Dynamic voltage scaling on MPEG decoding,” in Proc. ICPADS, 2001, pp. 633–640. [21] K. Choi, K. Dantu, W.-C. Cheng, and M. Pedram, “Frame-based dynamic voltage and frequency scaling for a MPEG decoder,” in Proc. IEEE/ACM ICCAD, 2002, pp. 732–737.

Chin-Feng Lai (M’09) received the Ph.D. degree from the Department of Engineering Science, National Cheng Kung University, Tainan, Taiwan, in 2008. He has been an assistant professor at Department of Computer Science and Information Engineering, National Chung Cheng University since 2013. His research interests include multimedia communications, sensor-based healthcare, and embedded systems. After receiving the Ph.D. degree, he has authored/coauthored over 80 refereed papers in journals, conference, and workshop proceedings about his research areas within four years. Now, he is making efforts to publish his latest research in the IEEE Transactions on Multimedia and the IEEE Transactions on Circuit and System on Video Technology. Dr. Lai is also a member of the IEEE Circuits and Systems Society and IEEE Communication Society.

Ying-Xun Lai received the M.S. degree in electrical engineering from the National Sun Yat-sen University, Kaohsiung, Taiwan, in 2008. He is currently working toward the Ph.D. degree in engineering science at the National Cheng Kung University, Tainan, Taiwan. His main research interests are in embedded systems and home digital multimedia services.

201

Ming-Shi Wang received the B.S. degree in electronics engineering from Feng Chia University, Taichung, Taiwan, in 1977, the M.S. degree in electrical engineering from National Cheng Kung University, Tainan, Taiwan, in 1982, and the Ph.D. degree in computation from the University of Manchester Institute of Science and Technology, Manchester, U.K., in 1992. He is currently an Associate Professor in the Department of Engineering Science and the Director of the Division of Teaching and Research, Computer and Network Center, both at National Cheng Kung University. His major research interests are digital image processing, computer vision, computer network, and advanced machine learning.

Jian-Wei Niu received the Ph.D. degrees in computer science from the Beijing University of Aeronautics and Astronautics, Beijing, China, in 2002. He was a Visiting Scholar at the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, from January 2010 to February 2011. He has published more than 90 referred papers on International Conference on Computer Communications, ACM Transactions on Embedded Computing Systems (TECS), Journal of Parallel and Distributed Computing, etc., and filed more than 30 patents in mobile and pervasive computing. He has served as Associate Editor in Chief (EiC) of the International Journal of Ad Hoc and Ubiquitous Computing, Associate EiC of the Journal of Internet Technology, and Editor of the Journal of Network and Computer Applications (Elsevier). His research work is sponsored by NSFC, National 863 Plan of China, Nokia, and other funds. His current research interests include mobile and pervasive computing and mobile video analysis. Dr. Niu served as Program Cochair of the IEEE Symposium on Embedded Computing 2008, Vice-Chair of Cyber, Physical and Social Computing (CPScom) 2013, and Technical Program Committee member of International Conference on Communications, Wireless Communications and Networking Conference, Global Communications Conference, International Wireless Communications and Mobile Computing Conference, China Wireless Sensor Network Conference (CWSN), etc. He was the recipient of the New Century Excellent Researcher Award from the Ministry of Education of China in 2009, an Innovation Award from the Nokia Research Center, and the best paper awards in CWSN 2012 and the 2010 IEEE International Conference on Green Computing and Communications (GreenCom 2010).