An ILP Based Approach to Address Code Generation for ... - CiteSeerX

1 downloads 0 Views 125KB Size Report
May 2, 2006 - paper targets at DSP based architectures and proposes an ILP (in- ... that the proposed ILP-based approach is very effective in reducing.
An ILP Based Approach to Address Code Generation for Digital Signal Processors ∗

O. Ozturk

M. Kandemir

S. Tosun

Dept. CSE, PennState University Park, PA 16802

Dept. CSE, PennState University Park, PA 16802

Dept. EECS, Syracuse Univ. Syracuse, NY 13244

[email protected]

[email protected]

[email protected]

ABSTRACT One of the most important problems in resource-constrained embedded systems is limited memory space for code and data. This paper targets at DSP based architectures and proposes an ILP (integer linear programming) based approach for reducing code memory space requirements by exploiting the auto-increment and autodecrement addressing modes provided by DSPs. Specifically, we address the problem of effective use of address registers, demonstrate how we can take advantage of additional capabilities that exists in some recent DSPs (such as modify registers), and discuss how our ILP-based solution can be used for performing tradeoffs between code memory and data memory space requirements. We also compare our approach to a previously-proposed heuristic solution. Our experimental analysis using several applications indicate that the proposed ILP-based approach is very effective in reducing both code memory demand and execution cycles, and the solution times it takes are within tolerable limits.

Categories and Subject Descriptors D.3.4 [Processors]: Code generation; C.3 [Special-Purpose and Application-Based Systems]: Signal processing systems

General Terms Algorithms, Design, Experimentation

Keywords Code generation, DSP, Integer Linear Programming

1.

INTRODUCTION

Reducing code space requirements of embedded applications is a critical problem as many embedded systems have tight memory space constraints. While it is true that computer architects and circuit designers are able to fit increasingly large amounts of memory in the same area, the increase in application sizes and complexity far exceeds the increase in effective memory capacity of embedded systems. Driven by these observations, recent research has ∗ This work is supported in part by NSF Career Award #0093082 and a grant from GSRC.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’06, April 30–May 2, 2006, Philadelphia,PA, USA. Copyright 2006 ACM 1-59593-347-6/06/0004 ...$5.00.

proposed several approaches to reducing code space requirements. The effective utilization of code memory is even more pressing on multiprocessor-on-a-chip type of architectures where multiple cores compete for the same memory space. In principle, code memory occupation can be reduced in two ways. First, one can employ a compression based approach, where some parts of the application code are kept in compressed form in code memory and are decompressed only when it is necessary to do so. The prior efforts along this direction considered both hardwareassisted techniques and software-driven approaches. The second alternative is to be careful in generating code so that the growth in code space demand can be kept under control. An example of this approach is address code generation for DSP architectures. Most DSPs today include an address generation unit (AGU) which can operate in parallel with the main execution unit(s). AGU is typically equipped with a set of address registers that keep the addresses of the data elements accessed. The important point here is that, using auto-increment and auto-decrement addressing modes, an instruction can perform an address calculation in parallel with instruction execution. This brings two main benefits. First, it increases on-chip parallelism, thereby reducing application execution time. Second, and more importantly from the memory space viewpoint, it helps reduce code size since using auto-increment/autodecrement modes eliminates explicit loads to address registers in the AGU. The prior research in exploiting these modes for reducing code memory occupation considered both data layout assignment to variables (also called the offset assignment problem [8]) and computation re-ordering [12]. A variant of the address register allocation problem is studied by Liem et al [10]. In their work, the number of pointers needed to represent array references is minimized. References [2], and [9] are examples of the prior studies that focused on memory layout for scalar DSP codes. In [1], Araujo et al present a technique, which allocates address registers to array accesses based on path covering and graph coloring. In this paper, we focus on a DSP based environment and propose an ILP (integer linear programming) based approach to the layout assignment problem for increasing the use of auto-increment and auto-decrement addressing modes during code generation. We address the problem of effective use of address registers, demonstrate how we can take advantage of additional capabilities that exists in some recent DSPs (such as modify registers), and discuss how our ILP-based solution can be used for performing tradeoffs between code memory and data memory space requirements. We also compare our approach to a previously-proposed heuristic solution. It is to be noted that many embedded systems employ DSP-based infrastructures. Examples include application areas such as radar and sonar systems, wireless communications and software radio, imaging and multimedia, surveillance and security, and medical imaging and instrumentation. Many of these application areas integrate the power of DSP-based processing with the very latest ana-

Table 1: Address register assignment optimization dimensions. Restructuring ↓ / Code Domain → Data Layout Access Pattern

Array (I) (II)

Scalar (III) (IV)

logue to digital converter technology, supporting a range of protocols such as Gigabit Ethernet, Infiniband and Fibre Xtreme. Therefore, the code space reduction strategy proposed in this paper will be beneficial to a large class of embedded applications. The next section discusses related work. Section 3 explains how AGU works in DSPs. Section 4 discusses the details of our ILPbased approach. An experimental evaluation of the proposed approach and its comparison with the previous work are presented in Section 5. Finally, Section 6 presents our concluding remarks.

2.

DISCUSSION OF RELATED WORK

Numerous DSP-related optimization techniques, including data layout optimization techniques, and code compression schemes [5], [4], [13], have been studied in the past. In this section, we focus only on the related work on address register optimizations. Within the address register optimization domain, the past research can be roughly classified along two dimensions as shown in Table 1. In the array-based code domain, most of the prior work on optimizing AGU usage concentrate on allocating address registers for a given code. Consequently, it is not possible to categorize them in group (I) or in group (II). The work done by Cheng and Lin [3] falls in group marked (I) in Table 1. Specifically, they propose a multi-phase data reordering and a graph-based address register allocation for array-based DSP codes. Lee and Park [6], on the other hand, study a code restructuring technique for optimizing AGU usage across loop iterations. References [2] and [15] are examples of prior studies that focused on memory layout for scalar codes (marked as (III) in Table 1). The studies in this group mainly try to determine an optimal memory layout for program variables based on a given variable access sequence. References [11] and [15] focus on scalar codes. Another possible approach to solving this problem is based on the simplification of access graph [16] so that more exhaustive solution methods could be deployed. Code restructuring has been used in the group of studies marked as (IV) in Table 1 in order to increase the benefits of AGU utilization. Code restructuring can be used with or without memory layout optimization. To our knowledge, the work presented in this paper (which falls in category (III) in Table 1) is the first one that studies an ILP-based approach to obtain an optimal memory layout for program variables given a variable access sequence.

3.

modify registers can also co-exist in the same instruction that specifies a normal ALU operation. That is, the same instruction can indicate a normal ALU operation plus an AGU operation. Note that, if we do not exploit auto-increment /auto-decrement modes and do not use modify registers, the compiler needs to issue “explicit instructions” that modify (load) the contents of the address register before the next memory access could be made using that address register. This typically requires an extra execution cycle at runtime. Therefore, effective use of an AGU can both reduce code size and overall execution cycles (our focus in this work is mainly on the code size issue). If not optimized, instructions that manipulate address registers can account for over 50% of total program bits [14].

ADDRESS CODE GENERATION

Figure 1: A typical address generation unit (AGU) employed in current DSPs. Let us now consider an example scenario that demonstrates how variable layout in memory space affects the number of auto- increment/auto-decrement modes used (and of course the number of explicit load operations on the address registers). Assume that there are four program variables in the program, and these variables are (in the order of declaration in the program code) A, B, C, and D. If the variable access sequence (the order in which the program variables are accessed) we are considering is A-C-A-C-B-D-C-A, then the declarative storage order (i.e., storing variables in memory based on their declaration order in the program text) will only be able to use the auto-increment/auto-decrement modes twice, resulting in five additional load instructions. On the left part of Fig. 2, the declarative order memory layout is given. On the right hand-side of the same figure is the optimum memory layout. If the variables are stored in the order of A, C, B, and D (instead of the declarative order shown on the left), then accessing the same sequence would incur a single additional instruction (when accessing C after D). The access sequences and the explicit address loads on these sequences are also shown in the same figure. In the base case, only one bit is used for address registers to capture the auto-increment/autodecrement range. It is also possible to employ more number of bits to define an auto-increment/auto-decrement range (this depends on the architecture under consideration). That is, while in the simple case only one bit is used that has a range of (-1,1), by using B bits it is possible to define an auto-increment/auto-decrement range of (−2B−1 ,2B−1 ).

Fig. 1 (from [7]) depicts a typical AGU (address generation unit) employed in current DSPs. Several current architectures such as Texas Instrument’s TMS320C2X/5X [18], Analog Devices’ TigerSHARC [19] (which is a multi-core architecture), StarCore SC1200 [21], and Motorola’s DSP56K [20], have similar AGUs. In this unit, there are two register files, one for address registers and the other for modify registers. Note that an effective memory address can be constructed either by auto-incrementing or auto-decrementing an address register, or by adding the contents of an address register and the contents of a modify register. There are also instructions that load new values to these address registers and modify their contents. The important point is that, the auto-increment/autodecrement operations can be fit in an instruction format that conFigure 2: Memory layout for variables used in a given access setains an ALU operation. Not all the AGUs employ a modify register quence for declarative order (left) and for optimum order (right). set. However, when they do, the address calculations that involve

Similarly, modify registers help reduce the number of explicit address loads. By adding/subtracting the contents of a modify register to/from the address register, it is possible to access the next variable in the access sequence. This provides the possibility of exploiting the available address registers in a more flexible way. As an example, it can be possible using a modify register to reach a program variable from another one, which requires a longer jump than an address register can cover. Fig. 3 illustrates how modify register optimization can be useful for a given access sequence. In this example, there are seven program variables, namely, A, B, C, D, E, F, and G. The program access sequence under consideration is also shown in the figure. Although it can be extended to a larger number of registers, for this example, we assume that there is only one address register and one modify register. On the left side of this figure, one of the optimum memory layouts under an address register optimization is given. On the right-hand side is the version that makes use of the modify register. Instructions required to access the variables in the sequence are also given. As it can be seen from this figure, the modify register version (right) uses the modify register in address computations, which in turn reduces the total number of instructions. Specifically, the number of instructions for this access sequence is reduced from 5 to 3 by using a modify register along with the address register. This small example shows how modify registers can help in further reducing the number of explicit address loads.

location and a memory location can hold a single variable throughout the execution of the application. We define 0-1 variables for each variable and for each memory location. By using these 0-1 variables, we determine the memory locations at which the program variables are to be stored. We use the term ’program point’ to denote the steps of a given access sequence. Assuming that N denotes the length of the access sequence, M the number of variables in the application program to be optimized, our approach uses 0-1 variables to specify the potential memory location (L) of each variable. Specifically, we have: • Lv,l : indicates whether variable v is assigned to memory location l. We use P to identify the current variable pointed by the address register. More specifically, • Pv,s : indicates whether the address register r points to variable v at program point s. C is used to identify whether there is an explicit load to the address register at a program point or not. • Cs : indicates whether there is an explicit load to the address register at program point s. In order to identify whether a variable is in the auto-increment/autodecrement range of another variable, we use variable R. That is, • Rv1 ,v2 : indicates whether v1 and v2 are in the auto-increment/autodecrement range (to be explained shortly) of each other. Since a variable can be placed into a single memory location, the following constraint must be satisfied: M X

Lv,i = 1,

∀v.

(1)

i=1

In the above expression, index variable i iterates over all possible memory locations. We assume there are M locations, that is, as many as the number of program variables (i.e., we do not allow any gap between the memory locations of two neighboring variables). This implies a one-to-one mapping between the memory locations and the variables. This assumption will be relaxed later in the paper to have a wider address space than the number of variables. To ensure that only one variable is assigned to each memory location, we use the following expression: Figure 3: Memory layout for variables used in a given access sequence for optimum address register order (left) and for optimum modify register order (right) (AR = address register and MR = modify register). Activities given in red color (light) indicate an extra instruction.

4.

ILP FORMULATION

We use 0-1 ILP in this paper for determining the memory layout of program variables, and address/modify register assignment to these program variables. In our formulation below, N , M , Ra , Rm , and B represent, respectively, the length of the variable access sequence to be optimized, the number of program variables, the number of address registers, the number of modify registers, and the number of bits used to define the auto-increment/auto-decrement range. Also, Sequence holds the variable access sequence to be optimized (this is extracted by the compiler automatically from the program text). We used Xpress-MP [17] to formulate and solve our ILP problem.

4.1 Simple Offset Assignment In the simple offset assignment problem, it is assumed that only one address register is available (i.e., Ra = 1). Our objective is to minimize the number of explicit loads to this address register, given a variable access sequence. Our approach tries to achieve this by determining the optimal memory layout for program variables. We assume that each program variable is assigned to only one memory

M X

Li,l = 1,

∀l.

(2)

i=1

In addition, since the address register can store only one address at a time, the following constraint has to be satisfied during the course of execution at every program point: M X

Pi,s = 1,

∀s.

(3)

i=1

In this expression, we assume that initially the address register points to NULL. Therefore, Pv,0 = 0, ∀(v). If two program variable addresses are within 2B−1 distance, then these two variables can be accessed via auto-increment/auto-decrement mode. We express this as follows: Rv1 ,v2 ≥ Lv1 ,a + Lv2 ,b − 1, ∀v1 , v2 , a, b,

|a − b|