Minimizing Embedded Software Power Consumption Through ...

11 downloads 191 Views 216KB Size Report
be obtained from the official website of the 3rd Generation. Partnership ... compiler optimization techniques perform source-to-source ... software profiling tool.
Minimizing Embedded Software Power Consumption Through Reduction of Data Memory Access Shan Li,

Edmund M-K. Lai

Mohammed Javed Absar STMicroelectronics Asia Pacific Pte. Ltd. 20 Science Park Road, SINGAPORE 117674

School of Computer Engineering Nanyang Technological University Nanyang Avenue, SINGAPORE 639798

compiler that could parallelize or vectorize the application (e.g. compilers for Cray-T4). Applications running on networked computers require optimization for minimum message passing. For embedded systems, especially those running multimedia applications, the requirement is for methods that optimize data transfers. This will be the focus of this paper.

Abstract Software applications that involve multimedia signal processing typically have to process large amounts of data. They often involve the handling of data arrays in the form of nested loops. Experiments show that for this kind of applications data transfer (memory access) operations consume much more power than data-path operations. Our objective is to reduce memory access related power consumption by reducing the number of data transfers between processor and memory, or between a higher (closer to processor) level of memory and a memory at a lower level using source program transformation. The procedure involves profiling, inlining and global transformation. The effectiveness of this procedure is illustrated by applying it to the software for a wideband adaptive multi-rate (WB-AMR) speech decoder which can be obtained from the official website of the 3rd Generation Partnership Project (3GPP).

Compiler optimizations can broadly be divided into two levels. At the lower level, the translation is more hardware-dependent, which limits the usage of the compiler to a certain class of processors. Higher level compiler optimization techniques perform source-to-source transformations. Typically, loop transformations and data transformations are performed on the source codes, with the output in the same language as the input source. Considerable research has been done in the area of loop and data transformation for data locality and increased parallelism. A good example is the decade-long SUIF project [17] at Stanford University. The main objective of our work, however, is in applying these high-level compiler techniques for the purpose of reducing embedded software power.

1. Introduction

Our specific target of power reduction is multimedia applications, which are data-dominated. This kind of applications access large data streams represented by arrays inside (nested) loops. Power consumption of this kind of applications is largely contributed by data transfer and memory access operations [1,2,3]. In contrast, datapath operations consume much less power. Hence, a method should be found to reduce the power consumption of the data memory.

Power has been an increasingly important cost factor in embedded system designs. One of the reasons is that portable hand-held devices require long battery life. Moreover, high power also means costly packaging and cooling requirements, and lower reliability. Consequently, when designing an embedded system, power-efficient design has become a critical concern. Power consumption of an embedded application is a complex function of several components. We could analyze the problem at several levels and devise strategies for power reduction that takes into account one or more of these levels [7]. Low-level techniques like voltage scaling, clock frequency scaling, clock-gating [8] and pipeline gating [9] have been known to be very useful. At the application level, source code transformation using optimizing or restructuring compilers have been gaining prominence lately. The advantage of optimizing compilers lies in the fact that they are not application specific. However, different classes of applications and different kinds of target platforms require slightly different types of optimizations. A scientific application, which contains large amounts of computations, needs an optimizing

Our solution is to minimize data transfer and memory access operations in multimedia applications, which is equivalent to minimizing array accesses inside (nested) loops. The way to achieve this goal is to perform the highlevel (source-to-source) transformation on arrays inside loops, such that multiple accessed data can be stored into registers and redundant memory accesses are minimized. The procedure involves profiling, inlining and global transformation. The three procedures are described in more detail in Section 2. They are applied to the WB-AMR speech decoder application and their effectiveness in power

1

achieved by increasing the exploitation of array data reuse in space (spatial reuse) and data reuse in time (temporal reuse). Using multiple data points of a cache line before the line is replaced with some other line is an example of spatial reuse. Temporal reuse means multiple usage of the same data very close in time.

reduction is discussed in Section 3, followed by the conclusions.

2. Approach 2.1 Profiling

Reuse is something inherent in the computation and does not depend on the particular way the loops are written [13]. Loop transformation aims to obtain a better data locality and increase the exploitation of reuse. For example, an array data is written first and followed by a read some time later after a few iterations. There is reuse of this array data. If the write and the read are too far away in time, such that the data has to be put into memory after the write and before the read then there is poor locality. However, if a loop transformation can be performed to bring the read and the write closer (better locality), then the data can be put into register. In this example, the memory access is reduced by exploiting temporal reuse.

Starting from the source code of an application, which is currently restricted to be written in the “C” programming language here, a profile of the real-time execution of the program is required. In the profile, the number of the data array reads and writes for various parts of the program is first obtained. This counting process is done by means of a software profiling tool. From the profile, areas of heavy data array access (referred to as bottlenecks later) and the call-path of these arrays are identified. These bottlenecks are candidates for later transformations.

2.2 Inlining

In the experiment, different types of loop transformations are applied to increase reuse and register usage. The feasibility and the extent of applying transformation are constrained by data dependence, number of registers available and the increment in the code size. Experiment shows that loop merging and loop unrolling along with scalar replacement are the most effective transformations in reducing memory access. The following of this section will use the filtering operation as an example and discuss how to apply unrolling and scalar replacement on it to reduce memory access.

Function inlining replaces a call to a function or a subroutine with the body of the function or subroutine. In writing procedural programs, the concept of functions provides clarity of programming and ease of debugging. However, as far as program optimization is concerned, the drawback of functions is that they may act as brick walls between sections of code. The aim of inlining is to remove these brick walls and to enlarge the exploration space for optimization. At the end of this step, an inlined version of the source code based on the profile is obtained. In the experiment, a call-based inlining scheme is adopted. Here, the decision to inline may be made independently at each call-site (where a function call is raised). With the bottlenecks identified in the previous step, the call-sites at these bottlenecks are candidates for inlining. The criteria of inlining is that it should facilitate the global transformation step within acceptable code size. For example, if there is redundant access to a common array in both the calling function and the called function, the called function could be inlined into this calling function to enable memory access reduction transformations. The main drawback of inlining is that it will increase the code size. It is demonstrated in [5] that maximizing the reduction in function calls under code size constraint is NP-hard. In the experiment, only the bottleneck functions are evaluated for inlining.

Filtering is operations. number of operation. prediction:

one of the most common signal processing It is helpful to have a method to minimize the memory accesses for this frequently used As an example, we consider the forward

D−1

y[n] = ∑ h[i]x[n − i] = h[0]x[n] + h[1]x[n −1] + ... + h[ D]x[n − D + 1] i =0

n = 1,..., F

D is the filter order. F is the frame length. y[n] is the signal, which is computed from the D samples of the input signal x[n]. h[i ] is the prediction parameter. The “C” code for this filter operation is shown in figure 1(a), which consists of a double-loop. During the unrolling process, unrolling factors are chosen for both the outer loop and the inner loop, denoted by u1 and u2 respectively. The statement S in the inner loop is unrolled to u1 * u 2 statements. The unrolled code, figure 1(b), reveals the redundant access to the array x illustrated by the dotted diagonals in figure 1(c). On each dotted line, the same array element is accessed. The number of accesses is the number of black nodes on a diagonal, black node represents one iteration. These redundant accesses are reduced by scalar replacement, which requires 2 * u1 + u 2

2.3 Global Transformation In order to obtain improvements in memory access reduction, loop transformations are applied to the inlined version of the source code. This transformation process is currently performed manually. Memory access is reduced if the number of the load instruction and the store instruction is reduced. This can be

2

additional registers (assuming scalars are stored in registers). The choices of u1 and u2 depend on the number of registers available and the constraints on the size of the unrolled code. For this example, they are set to 3. The memory access for the arrays is then reduced to a fraction of the original number of accesses. It is (u1 + u 2 − 1) /( u1 * u 2 ) = 5 / 9 for x and 1 / u1 = 1 / 3 for h.

u2 2

1

//Array y starts from the last D-1 //elements of the previous frame. //Array x starts from the new frame. for ( i=0; i < F; i++) { sum = 0; for (j = 0; j < D; j++) S: sum = sum + x[i+j] * h[j]; y[i] = sum; }

u1

0 0

1

2

(c): Illustration of u1 and u2 Figure 2 Loop unrolling followed by scalar replacement

Although loop unrolling increases the code size, it is effective in power reduction. This is because loop unrolling reduces the loop overhead and increases the amount of computation per iteration. From the power point of view, fewer computations mean less power dissipation [6]. Together with scalar replacement, the power consumption is further reduced by less memory accesses.

(a): Original “C” code of a prediction filter c1 = F - F % u1; c2 = D - D % u2; for(i = 0; i < c1; i += u1) { sum1 = 0; ..... sum(u1-1) = 0; for(j = 0; j < c2; j += u2) { sum1 = sum1 + x[i+j] * h[j]; sum1 = sum1 + x[i+j+1] * h[j+1]; .... sum1 = sum1 + x[i+j+u2-1] * h[j+u2-1]; sum2 = sum2 + x[i+j+1] * h[j]; sum2 = sum2 + x[i+j+2] * h[j+1]; .... sum2 = sum2 + x[i+j+u2] * h[j+u2-1]; .... sum(u1-1) = sum(u1-1) + x[i+j+u1-1] * h[j]; sum(u1-1) = sum(u1-1) + x[i+j+u1] * h[j+1]; .... sum(u1-1) = sum(u1-1) + x[i+j+u1+u2-2] * h[j+u2-1]; } for(j = c2; j < D; j++) { sum1 = sum1 + x[i+j] * h[j] .... sum(u1-1) = sum(u1-1) + x[i+j+u1-1] * h[j] } y[i] = sum1; ... y[i+u1-1] = sum(u1-1); } for(i=c1; i < F; i++) { sum1 = 0; for(j=0; j < D; j++) sum1 = sum1 + x[i + j] * h[j]; y[i]=sum1; }

Scalar replacement increases the register usage, which is the most effective way of reducing memory operands [10]. Register operand also has shorter running times due to elimination of potential stalls and cache misses. The example above shows how to apply loop transformations to increase register usage. A very important constraint of loop transformation is that it should not violate the data dependences in the original source code [12]. When applying loop transformations such as interchange, skewing and reversal, the data dependences should always be preserved. DO i=1,3 Do j=1,3 X[i,j] = X[i-1,j] + X[i,j-1]

Figure 3 Simple code for transformation

In the code above (Figure 3) the distance vectors for array x are {(1, 0), (0, 1)}. All the distance vectors are positive. After any transformation, the data dependence should still be positive. That is to say, the polarity of the distance vectors remains constant, only the absolute values of the distance vectors can change. For illustration, loop skewing is applied on this piece of code to improve data locality. The transformed code is shown in Figure 4. DO i=1,3 Do j=i+1,i+3 X[i,j-i] = X[i-1,j-i] + X[i,j-i-1]

(b): After unrolling

Figure 4 Code after loop skewing

3

The codes in Figure 3 and Figure 4 are equivalent in terms of the final result. This is shown in Figure 5.

4. Conclusions In this paper, we proposed a novel solution to the problem of power reduction for embedded software. The specific target of the research is multimedia applications, in which case data memory access is a big contributor to the system power. The approach presented here performs power reduction by minimizing the amount of memory accesses through high-level compiler transformations on the source program. It involves three steps: profiling, inlining and global transformation. The effectiveness of this approach is demonstrated by applying it to a practical application: WB-AMR speech decoder. On-going work includes the incorporation of these procedures into a compiler.

Acknowledgements

Figure 5 Data dependence before and after loop skewing

The graph on the left hand side shows the data dependence of the original code, and the right hand side is the new data dependence of the transformed code. The new distance vectors after loop skewing are {(1, 1), (0, 1)}, which are positive vectors. Hence, the data dependence is said to be preserved.

The authors would like to thank the Centre for Multimedia and Network Technology in the School of Computer Engineering, Nanyang Technological University and the Audio DSP Systems Group of STMicroelectronics Asia Pacific Pte Ltd for the support provided for this research work.

3. Example

References

We applied the procedures to the entire WB-AMR speech decoder program [4], which has about 15,000 lines of code. This codec has nine transmission modes. Table 1 shows the number memory accesses (reads and writes) for all nine modes of operation before and after the optimization.

[1] F. Catthoor, S. Wuytack, E. D. Greef, F. Balasa, L. Nachtergaele and A. Vandecappelle, Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design. Boston: Kluwer Academic Publishers, 1998.

Table 1 Reduction in memory accesses for different modes

[2] V. Tiwari, S. Malik, A. Wolfe, “Power analysis of embedded software: a first step towards software power minimization,” Proceedings of the IEEE Conference on Computer Aided Design, Santa Clara CA, pp. 384-390, November 1994.

0

Org. Total Accesses 25021926

Opt. Total Accesses 18140790

Reduction (%) 27.5

1

23330696

16763844

28.1

2

22436274

16066076

28.4

3

22465554

16089769

28.4

4

22493862

16113250

28.4

5

22536148

16155028

28.3

6

22561164

16177250

28.3

7

22617641

16229155

28.2

8

26815942

18339204

31.6

Mode

[3] R. Gonzales, M. Horowitz, “Energy dissipation in general-purpose microprocessors,” IEEE Journal of Solidstate Circuit, Vol.SC-31, No.9, pp. 1277-1283, September 1996. [4] The 3rd Generation Partnership Project (3GPP), http://www.3gpp.org. [5] R. W. Schiefler. “An analysis of inline substitution for a structured programming language,” Comm. Of the ACM, pp. 20(9): 647-654, 1977. [6] Kevin R. Wadleigh, Isom L. Crawford, Software Optimization for High-performance Computing. By Hewlett-Packard Company. Prentice Hall PTR, PrenticeHall, 2000.

The results show that the percentage reduction in memory access for the nine modes is between 27.5% and 31.6%.

4

[7] R. Graybill and R. Melhem, Power Aware Computing. New York: Kluwer Academic / Plenum Publishers, 2002. [8] R. Graybill and R. Melhem, Power Aware Computing. New York: Kluwer Academic / Plenum Publishers, 2002, Chap. 1, pp. 4-5. [9] R. Graybill and R. Melhem, Power Aware Computing. New York: Kluwer Academic / Plenum Publishers, 2002, Chap. 4, pp. 61-66. [10] V. Tiwari, S. Malik and A. Wolfe, “Compilation techniques for low energy: an overview,” IEEE Symposium on Low Power Electronics, 1994. [11] V. Sarkar, “Automatic selection of high-order transformations in the IBM XL Fortran compilers,” IBM Journal of Research and Development, Vol. 41, No. 3, pp. 233-264, May 1997. [12] U. Banerjee, Loop Transformations for Restructuring Compilers: The Foundations. Kluwer Academic Publishers, 1993. [13] M. E. Wolf and M. S. Lam, “A data locality optimizing algorithm,” ACM SIGPLAN Conference on Programming Language Design and Implementation, June 1991. [14] A. W. Liim and M. S. Lam, “Maximizing parallelism and minimizing synchronization with affine partitions,” Parallel Computing, Vol. 24, No. 3-4, pp. 445-475, May 1998. [15] V. Sarkar and R. Thekkath, “A general framework for iteration-reordering loop transformations: technical summary,” ACM SIGPLAN Conference on Programming Language Design and Implementation, Vol. 27, pp. 175187, June 1992. [16] V. Loechner, B. Meister and P. Clauss, “Precise data locality optimisation of nested loop,” the Journal of Supercomputing, Vol. 21, pp. 37-76, Jan. 2002. [17] M. S. Lam, “A retrospective: a data locality optimizing algorithm,” in 20 Years of Programming Language Design and Implementation (1979-1999): A Selection, Jun. 2003.

5