Minimizing Register Requirements of a Modulo Schedule ... - CiteSeerX

0 downloads 0 Views 307KB Size Report
stage scheduler to the MRT-schedules of a register-insensitive modulo scheduler. Keywords: Register-sensitive modulo scheduling, software pipelining, ...
Minimizing Register Requirements of a Modulo Schedule via Optimum Stage Scheduling Alexandre E. Eichenberger and Edward S. Davidson

Santosh G. Abraham

Advanced Computer Architecture Laboratory EECS Department, University of Michigan Ann Arbor, MI 48109-2122

Hewlett Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94304

alexe,[email protected]

[email protected]

Abstract

Modulo scheduling is an ecient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present an approach that schedules the loop operations for minimum register requirements, given a modulo reservation table. Our method determines optimal register requirements for machines with nite resources and for general dependence graphs. Measurements on a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels show that the register requirements decrease by 24.5% on average when applying the optimal stage scheduler to the MRT-schedules of a register-insensitive modulo scheduler. Keywords: Register-sensitive modulo scheduling, software pipelining, instruction level parallelism, VLIW, superscalar.

1 Introduction Software pipelining is a technique for exploiting the instruction level parallelism present among the iterations of a loop by overlapping the execution of several loop iterations. With sucient overlap, some functional unit can be fully utilized, resulting in a schedule with maximum throughput. Modulo scheduling [1][2][3] restricts the scheduling space by using the same schedule for every iteration of a loop and by initiating successive iterations at a constant rate, i.e. one initiation interval (II clock cycles) apart. This restriction, known as the modulo scheduling constraint, requires that all usages of any particular resource by a single iteration must be scheduled at distinct times modulo II. The scope of modulo scheduling has been widened to a large variety of loops. Loops with conditional statements are handled using hierarchical reduction [4] or IF-conversion [5][6]. Loops with conditional exits can also be modulo scheduled [7]. Furthermore, the code expansion due to modulo scheduling can be eliminated when using special hardware such as rotating register les and predicated execution [8]. As modulo scheduling achieves higher throughput by overlapping the execution of several iterations, it results in higher register requirements. In fact, register requirements generally increase as concurrency increases, whether due to using and exploiting machines with faster clocks and deeper pipelines, wider issue, or a combination of both [9]. As a result, a scheduling algorithm that reduces the register pressure while scheduling for high throughput is increasingly important. In this paper, we treat modulo scheduling as a three step procedure. Some code generation strategies treat each step separately; e.g. Eisenbeis's software pipelining approach [10] uses distinct heuristics for Steps 1, 2, and 3 below. Other code generation strategies combine steps to simultaneously meet distinct objectives; e.g. Hu 's lifetime-sensitive modulo scheduler [11] combines Steps 1 and 2 below.  To

appear in the International Journal of Parallel Programming, February, 1996.

1

1. MRT-scheduling primarily addresses resource constraints and is best implemented by using a modulo reservation table (MRT) [2][12] which contains II rows, one per clock cycle, and one column for each resource at which con icts may occur. Filling the MRT consists of packing the operations of one iteration within the II rows of the MRT to obtain a schedule with no resource con icts. This step determines the distance modulo II between any two operations. As steady-state throughput is simply the reciprocal of II, choosing the minimum feasible II achieves the highest possible steady-state performance. 2. Stage-scheduling primarily addresses dependence constraints, which specify that the distance of each pair of dependent operations is no less than the latency for such a pair. These constraints are satis ed by delaying each operation by some (integer) multiple of II. Note that there may not be any feasible stage schedule for a given MRT-schedule because of some critical recurrence cycles in the dependence graph. In this case, Step 1 must generate a new MRT-schedule, possibly with a larger initiation interval. 3. Register allocation performs the actual allocation of virtual registers to physical registers, a scheme that may vary depending on the hardware support available and the desired level of loop unrolling and loop peeling [4][8][13]. In this paper we present a method that performs the second step in an optimal fashion for loop iterations that consist of a single basic block with a general dependence graph. For IF-converted basic blocks, we assume that all predicates are set to true when performing Step 2. Our method satis es the dependence constraints while simultaneously minimizing MaxLive [8][11], the minimum number of registers required to generate spill-free code. MaxLive corresponds to the maximum number of live values at any single cycle of the loop schedule. We present two algorithms for stage-scheduling, both of which schedule for machines with nite resources. The rst linear-time algorithm can handle loops whose dependence graphs have acyclic underlying graphs.1 The second general algorithm uses a linear programming approach and can handle general dependence graphs with unrestricted loop-carried dependences and common sub-expressions. We can also quickly determine when the faster rst method is applicable. We investigate the performance of the optimum stage scheduler and other schedulers for a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels for a machine with complex resource usage, the Cydra 5 [14]. Our empirical ndings show that the register requirements decrease by 24.5% on average in this benchmark suite when applying the optimal stage scheduler to the MRT-schedules of a register-insensitive modulo scheduler. Though the general algorithm presented in this paper may be too slow to be integrated in a production compiler, the decrease in register requirements motivates the development of better heuristics for combined scheduling and register allocation for modulo-scheduled loops. For example, we used the optimal stage scheduler as a guide to develop ecient stage scheduling heuristics [15]. This method is thus useful in assessing the e ectiveness of register-sensitive stage scheduling and modulo scheduling heuristics. In this paper, we present related work in Section 2. We discuss the impact of stage-scheduling on register requirements in Section 3, develop a linear programming model that minimizes integral MaxLive for a given MRT in Section 4 and a method that minimizes total MaxLive in Section 5. We compare the optimum stage schedules with results of other schedulers on a benchmark suite of 1327 loops in Section 6 and present conclusions in Section 7.

2 Related Work Developing modulo scheduling techniques that exploit instruction level parallelism while containing the register requirements is crucial to the performance of future machines. A signi cant body of research has sought e ective register-sensitive scheduling algorithms: optimal solutions have been researched under various simplifying restrictions and ecient heuristics have been investigated. The algorithms of this paper precisely handle the register requirements for machines with nite resources and nd optimum (minimum register) solutions for the restricted problem of stage scheduling a given MRT. 1

The underlying graph is an undirected graph formed by ignoring the direction of each arc.

2

A precise modeling of the register requirements of modulo schedules was rst proposed by MangioneSmith et al for loop iterations where each virtual register is used at most once, i.e. for forest of trees dependence graphs [9]. They also presented a linear-time stage scheduler resulting in minimum MaxLive for loop iterations with dependence graphs that are forests of trees. The work presented here extends their results in two directions. First, our linear-time stage scheduler permits multiple uses of each virtual register, provided that the underlying dependence graph is acyclic. Second, our general stage scheduler handles arbitrary dependence graphs, including loop-carried dependences and common sub-expressions. Searching for modulo schedules with the highest steady-state throughput over the entire space of modulo schedules, jointly solving Steps 1 and 2, was rst formulated by Hsu for machines with nite resources and for loop iterations with arbitrary dependence graphs, a formulation that satis es scheduling dependences, but does not attempt to minimize the register requirements [2]. Ning and Gao proposed a polynomial-time algorithm that results in a schedule with the highest steady-state throughput over all modulo schedules, and minimum bu er requirements among such schedules [16]. In their work, and elsewhere [17][18][19], the register requirements of a schedule are approximated by conceptual FIFO bu ers that are reserved for an interval of time that is a multiple of II. Their scheduling algorithm handles loop iterations with arbitrary dependence graphs. Ning and Gao's results were extended for machines with nite resources by Govindarajan et al [17]. Recently, we contributed a scheduling algorithm that jointly addresses Steps 1 and 2 by combining the precise modeling of the register requirements with the complete description of the modulo scheduling space [20]. In that work, we nd a schedule with the highest steady-state throughput over all modulo schedules, and the minimum register requirements among such schedules. That work handles machines with arbitrary resource requirements and loop iterations with arbitrary dependence graphs. By considering all MRTs for a given II, that algorithm generally results in schedules with lower register requirements, compared to the optimal stage scheduling algorithm for a given MRT proposed in this paper; however, that algorithm is extremely computationally intensive. As a result, only small loops were handled using that approach, i.e. loop iterations with up to 12 operations and II up to 5 [20]. In this paper our benchmark suite contains loop iterations with up to 161 operations and II up to 165. Another body of research has sought ecient register-sensitive scheduling heuristics. For example, Hu investigated a heuristic based on a bidirectional slack-scheduling method that schedules operations early or late depending on their number of stretchable input and output ow dependences [11]. Llosa et al have recently proposed a heuristic that is based on a bidirectional slack-scheduling method with a scheduling priority function tailored to minimize the register requirements [21]. Based on the optimal stage scheduler presented in this paper, we have also investigated fast stage scheduling heuristics [15]. Our best linear-time stage scheduling heuristic decreases the register requirements by 24.4% over a register insensitive-scheduler with the same MRT, achieving on average 99.7% of the decrease in register requirements obtained by an optimal stage scheduler on the benchmark suite used in this paper. Register allocation algorithms have been investigated by Rau et al [8]. Their algorithm achieves register allocations that are within one register of the MaxLive lower bound for the vast majority of their modulo-scheduled loops on machines with rotating register les and predicated execution to support moduloscheduling. By using loop unrolling and consequent code expansion, their algorithm typically achieves register allocations that are within four registers of the MaxLive lower bound on machines without such hardware support.

3 Precise Register Requirements of a Modulo Schedule In this section, we present the precise register requirements of a modulo schedule and illustrate the impact of stage scheduling on the register requirements with a few examples. The example target machine is a hypothetical processor with three independent and fully pipelined functional units: load/store, add/sub, and mult/div. The memory latency and the add/sub latency is one cycle, the mult latency is four, and the div latency is eight. We selected these values to obtain concise examples; however, our method works independently of the numbers of functional units, resource constraints, and latencies.

Example 1 Our rst example uses a simple kernel to illustrate how to compute the register requirements of a modulo-scheduled loop. This kernel is: y[i]

, where the value of x[i] is read from memory,

= x[i]2+a

3

a) Dependence graph

c) Replicated MRT

ld vr0 *

d) Lifetimes

Stage:

Time: 0

ld

1

st

2

ld

3

st

4

ld

5

st

6

ld

+

+

*

vr0 ld

vr1

vr2

Latency portion of the lifetime

0 Additional lifetime

*

* 1

vr1 + vr2 st

7

st

+

*

+

*

Schedule of one iteration 2 + 3

st

b) MRT (II = 2)

e) Register requirements vr0

0

ld

1

st

+

*

vr1

vr2

Integral part

Time: 0,2,4,6 1,3,5,7

Fractional part

Figure 1: MRT and stage schedules (Example 1). squared, incremented by a constant, and stored in y[i] as shown in the dependence graph of Figure 1a. The vertices of the dependence graph correspond to operations and the edges correspond to virtual registers. The value of each virtual register is de ned by a unique operation; once its value has been de ned, it may be used by several operations. In this paper, a virtual register is reserved in the cycle where its de ne operation is scheduled, and remains reserved until it becomes free in the cycle following its last-use operation. Thus in our machine model, the additional reserved time beyond the beginning of the last-use operation cycle is one clock cycle. Although the machine model has an impact on the formulation of the register requirements, the technique presented in this paper can be directly adapted to other models as well. Other machines, for example, may have an additional reserved time which is 0,  2, or operation dependent. The lifetime of a virtual register is the set of cycles during which it is reserved. The MRT-scheduler places each operation of an iteration somewhere within the modulo reservation table (MRT) such that no resource is used more than once at times that are congruent modulo II, i.e. there is at most one entry in each MRT cell. Figure 1b illustrates an MRT-schedule with II = 2 for the kernel of Example 1 on the target machine. In this schedule, the load, add, and mult operations are scheduled in the rst row and the store operation is scheduled in the second row of the MRT. Once the MRT-schedule is completed, the stage-scheduler delays each operation by some integer multiple of II cycles so that it is scheduled only after its input values have been calculated and made available to it by its predecessor operations. By delaying operations only by multiples of II, the row in which each operation is placed is unaltered; thus, the resource constraints are guaranteed to be ful lled for any stage schedule associated with a valid MRT-schedule. A stage-schedule for Example 1 is shown in Figure 1c. This gure is an execution trace; the MRT is replicated suciently to show one complete iteration. Each replication is referred to as a stage [8] of the software pipelined loop. The circles highlight the operations that belong to the iteration initiated at a time 0. The circles must be placed so as to satisfy the speci ed latencies. The virtual register lifetimes for this iteration are presented in Figure 1d. The black bars correspond to the initial portion of the lifetime which minimally satis es the latency of the de ning functional unit; the grey bars show the additional lifetime of each virtual register. Ideally, the additional lifetimes should not be longer than one cycle; however, because of resource or dependence constraints, it is not always possible to schedule each use operation immediately after its input operand latencies expire. The dependence constraints are satis ed if and only if no circled operation is placed during the portion of the lifetime that overlaps with the black bar of a virtual register that it uses. For example, the add 4

operation uses vr1 and thus cannot be scheduled at time 2 or 4, but can be scheduled at time 6. Since the operation is placed in row 0 of the MRT, only the rows in the replicated MRT that are congruent to 0 (modII) are considered. The add will thus be delayed by an integer multiple of II, 2  II in this example. Traditionally, a schedule is represented by associating each operation with a schedule time; in this paper, however, a schedule is represented by associating each dependence edge to a schedule time interval, referred to as the skip factor. This novel representation is critical for expressing the stage scheduling problem eciently. Consider a dependence edge from operation i to operation j, scheduled at time and time , respectively. We de ne the skip factor along edge (i; j) as the number of times the row of operation j is encountered in the execution trace during the time interval [time ; time ). In Figure 1c, for example, the skip factor along the dependence edge (mult, add) is 2, as the row of the add operation is encountered twice in the interval [2, 6), once at time 2 and once at time 4. The skip factor is within 1 of the stage di erence used in [9][16], and results in a simpler, but equivalent, formulation. add

i

i

j

j

Formal Problem De nition: Among all the stage schedules that employ a given MRT, nd one (i.e.

assign the skip factors) that satis es all the dependence constraints and minimizes MaxLive, the maximum number of live values at any single cycle of the loop schedule.

Stage scheduling is performed by assigning an integer value to each skip factor, i.e. for each dependence edge, so as to generate a valid stage schedule that minimizes MaxLive. Because of the modulo constraint, MaxLive can be quickly computed, as illustrated in Figure 1e, by wrapping the lifetimes for the rst iteration around a vector of length II. In particular, Figure 1e shows the number of live registers in steady-state loop execution and is constructed by replicating Figure 1d with a shift of II cycles between successive copies until the pattern in II successive rows repeats inde nitely. Figure 1e then displays these II steady-state rows in a compact form. In Figure 1e, we see that exactly six virtual registers are live in the rst row and four in the second, resulting in a MaxLive of six. Figure 1e distinguishes between two distinct contributions to each virtual register lifetime: the integral and the fractional part. For virtual registers with several uses, only the last use and its skip factor, s, is considered. The integral part spans the entire MRT exactly s times as shown in Figure 1e. The fractional part spans the rows in the MRT inclusively from the def operation row forward with wraparound through the last-use operation row. The length of the fractional part thus ranges from 1 to II and is equal to 1 plus the modulo II distance between these rows, as shown in Figure 1e. The integral MaxLive is de ned as the number of integral part bars in a row, summed over all virtual registers. Similarly, the fractional MaxLive is the number of fractional part bars in the row with the most fractional part bars. The total MaxLive is the sum of the fractional and integral MaxLive. The stage-schedule presented in Figure 1 results in the minimum register requirements for this kernel, MRT, and set of functional unit latencies. Mangione-Smith et al [9] have shown that for dependence graphs with at most a single use per virtual register, as in Example 1, minimizing each skip factor individually always results in the minimum register requirements. However, this result does not apply to general dependence graphs with unrestricted common sub-expressions and loop-carried dependences. Our next example illustrates such a dependence graph in which minimizing each skip factor individually does not result in the minimum register requirements.

Example 2 Our second kernel is:

, where the value of x[i] is read from memory, squared, decremented by x[i]+a, and stored in , as shown in the dependence graph of Figure 2a. Figure 2b illustrates an MRT-schedule for the kernel of Example 2 and Figure 2c presents an execution trace in which each operation is scheduled in its earliest stage. Although the MRT-schedule shown in Figure 2b is constructed to illustrate our point concisely, experimental evidence indicates that similar situations occur in our benchmark suite. y[i] = x[i]2-x[i]-a y[i]

Notice the two distinct paths between operations load and sub in the dependence graph. Because we represent a stage schedule by associating time intervals with dependence edges, we must ensure that the sum of the time intervals along the path load-mult-sub is equal to the sum along the path load-add-sub. In general, there are distinct paths among operations if there is a cycle in the underlying dependence graph, i.e. if there is a cycle in the dependence graph where the directions of the arcs are ignored. In presence of a cycle in the underlying graph, we must ensure that the cumulative time intervals on a path that traverses 5

c) Replicated MRT

a) Dependence graph ld

Time: 0

ld



1

st

+

vr0 *

+

2

ld



vr1

vr2

3

st

+

4

ld



5

st

+

6

ld



7

st

+

− vr3 st

d) Lifetimes vr0 ld

*

vr1



1

st

+

Additional lifetime

*

*

Schedule of one iteration

*



*

st

e) Register requirements vr0

ld

Latency portion of the lifetime

+

b) MRT (II = 2)

0

vr3

vr2

vr1

vr3

vr2

Integral part

Time: 0,2,4,6

*

Fractional part

1,3,5,7

Figure 2: MRT and stage schedules with add scheduled early (Example 2).

a) Dependence graph

c) Replicated MRT

d) Lifetimes vr0

ld vr0

Time:0

ld



1

st

+

2

ld



3

st

+

4

ld



5

st

+

6

ld



7

st

+

+

* vr1

vr2

vr1

vr2

vr3 Latency portion of the lifetime

ld

*

Additional lifetime

*

*

− vr3 st

Schedule of one iteration

* + −

*

st

b) MRT (II = 2)

e) Register requirements vr0

0

ld



1

st

+

*

vr1

vr2

vr3

Integral part

Time: 0,2,4,6 Fractional part

1,3,5,7

Figure 3: MRT and stage schedules with add scheduled late (Example 2).

6

the cycle is zero. The cumulative time intervals of a path corresponds to the algebraic sum of the time intervals associated with the edges along that path, where a time interval is taken as negative when the edge is traversed in the reverse direction. Subsequently, we refer to the elementary cycles in an underlying dependence graph as underlying-cycles and to the new constraints associated with these cycles as underlyingcycle constraints. We present in detail how to derive the underlying-cycle constraints in Section 4. In Figure 2, the add operation is scheduled in the rst stage (at time 1). However, we can also schedule the add later (in stage 2 or 3, namely at time 3 or 5) without delaying the sub operation since the result of this operation must be kept alive through time 6 in any case. In Figure 3, the add operation is scheduled at time 5. By comparing Figures 2e and 3e, we see that scheduling the add operation later increases the lifetime of vr0 by 1 21 columns, but decreases the lifetime of vr2 by 2 columns. As a result, MaxLive is reduced from 9 to 8 by scheduling the add operation late. This result is surprising since we might expect the lifetime increase of vr0 to match the 2-column lifetime decrease of vr2. However, with an early add, the mult is the last use of vr0 and the rst cycle of additional add delay (row 2) does not increase the vr0 lifetime. Example 2 shows that scheduling each operation as early as possible does not necessarily result in minimum register requirements for dependence graphs that have cycles in their underlying graphs, and that a more global stage-scheduling algorithm is needed when underlying cycles are present. To summarize the ndings of this section, we have seen that stage scheduling can impact the register requirements for a given kernel, MRT, and set of functional unit latencies. A stage scheduler must enforce the dependence constraints, scheduling each operation only after all its input operands are available. In the presence of cycles in the underlying dependence graph, the stage scheduler must also insure that the cumulative distance that traverses the edges of each underlying-cycle is zero. Since underlying-cycles can interact with one another, minimizing the register requirements can only be achieved in general by reconciling these interactions globally for general dependence graphs.

4 Stage Scheduling for Minimum Integral MaxLive In this section, we present an algorithm that nds a stage schedule resulting in the minimumintegral MaxLive for a given kernel, MRT, and set of functional unit latencies. We introduce the variables that characterize a stage schedule and present the conditions that de ne a valid schedule. All variables presented in this section are summarized in Table 1. We omit here the fractional MaxLive because its behavior is highly non-linear. This omission results in stage-schedules that require no more than one additional register per virtual register that is used multiple times including at least one use by an operation in an underlying-cycle. However, our nal algorithm in Section 5 will take both the integral and the fractional MaxLive into account. We represent a loop by a dependence graph G = fV , E , E g, where the set of vertices V represents operations and the sets of edges E and E correspond, respectively, to the scheduling dependences and the register dependences among operations. A scheduling edge enforces a temporal relationship between a pair of dependent operations or between a pair that cannot be freely reordered, such as load and store operations to ambiguous (or the same) memory locations. A scheduling edge from operation i to operation j, w iterations later, is associated with a latency l and a loop-carried dependence distance ! = w. A register edge corresponds to a data ow dependence carried in a register. In this paper, each register edge is also a scheduling edge. However, there are scheduling edges that have no corresponding register edge, e.g. the dependence between a store and a load to an ambiguous memory location results in a scheduling edge but not in a register edge. Techniques that remove edges in the dependence graph may result in register edges without corresponding scheduling edges, as shown in [20]. These techniques can be applied to the model presented in this paper as well. We characterize the initial MRT-schedule by its initiation interval II and by the row of the MRT in which each operation is placed. We refer to the row in which operation i is placed as row . We also de ne the following distance relation between operations i and j: rdist distance from the row of operation i to the next row of operation j, possibly in the next instance of the MRT. sched

sched

reg

reg

i;j

i;j

i

i;j

7

which may be computed as follows: rdist

= dist(row ; row ) = (row ? row ) mod II (1) Note that since the MRT is given and unchanged by the stage scheduler, row and rdist are simply input constants. Using these terms, we may de ne the time interval along an edge from operation i to operation j (i.e. time ? time ) as: rdist + skip  II (2) where the rst and second term represent, respectively, the fractional part and the integral part of the schedule distance along edge (i; j). In this paper, we distinguish between two components of skip : s , the minimum (integer) skip factor necessary to hide the latency of the scheduling edge, and p , an additional (non-negative integer) skip factor used to postpone operation j further. The s are input constants, evaluated as in Equation (4) below, whereas the p are the fundamental variables determined by the stage scheduler. Consider a scheduling edge between operation i and a dependent operation j, ! iterations later. Using Equation (2), we may write the dependence constraints, i.e. (time + !  II) ? time  l , as: rdist + (! + s + p )  II  l (3) Equation (3) states that the distance between the row of operation i to the next row of operation j, increased by skipping ! + s + p entire instances of the MRT, must be no smaller than the latency l of the scheduling edge (i; j). Since we are interested in the smallest integer value of s that satis es Equation (3), regardless of the nonnegative value of p , we obtain the following value for the minimum skip factor s :   l ? rdist ?! (4) s = II Consequently, a scheduling edge (i; j) is satis ed when operation j is skipped at least s times after operation i before being scheduled. If ! is suciently large, s may be negative. Finally, we need to guarantee that all underlying-cycle constraints are ful lled. Consider a directed closed path in the dependence graph that traverses some underlying-cycle u. De ne sign to be +1 if the closed path traverses edge (i; j) in the forward direction, and ?1 if the closed path traverses edge (i; j) in the reverse direction. Sign is unde ned for edges (i; j) that do not belong to the underlying-cycle. Using Equation (2), the cumulative distance of an underlying-cycle u is thus: i;j

i

j

j

i

i

j

i;j

i

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

X

2

(i;j )

sign (u) frdist + (s + p )  II g i;j

i;j

i;j

By setting this cumulative distance to zero we obtain the following constraint: X X sign (u) p = ? sign (u) s ?  8u 2 U 2

(i;j )

where  is: u

i;j

(5)

i;j

u

i;j

2

(i;j )

u

i;j

i;j

G

(6)

u

X  = II1 sign (u) rdist ( )2 u

i;j

i;j

u

i;j

(7)

u

for each of the underlying cycle in the set U of underlying cycles in the underlying graph of the dependence graph GfV; E g. The rst noticeable fact is that all but the p variables are fully de ned by the input parameters to the stage scheduler. We are therefore free to set the p variables to any nonnegative integer value, as long as they satisfy Equation (6) for each underlying cycle. The second important fact is that Equation (6) has a solution only if  is itself integer valued, since all the s and p are integer valued. This is fortunately always the case, since the cumulative distance from an operation to any instance of itself is by de nition a multiple of II. G

sched

i;j

i;j

u

i;j

8

i;j

ld0 vr0 m1

A

vr1

ω 2,1=3

e) Feasable stage schedule

b) MRT (II = 2)

a) Dependence graph 0

ld0

1

st3

a2

Time:

m1

c) Input: rdist, latencies, and minimum skip factors

0

ld0

1

st3

2

ld0

3

st3

rdist0,1 = 0

l 0,1 = 1

s 0,1 =

1

4

ld0

rdist1,2 = 0

l 1,2 = 4

s 1,2 =

2

5

st3

rdist2,1 = 0

l 2,1 = 1

s 2,1 = −2

6

ld0

rdist2,3 = 1

l 2,3 = 1

s 2,3 =

7

st3

a2

m1

a2

m1

a2

m1

a2

m1

a2 vr2 st3

load: mult: add: store:

ld0 m1 a2 st3

0

p0,1 p1,2 p2,1 p2,3

d) Underlying cycle constraint A:

p1,2 + p2,1 = 0

= = = =

0 0 0 0

Figure 4: Computing the underlying cycle constraint (Example 3).

Example 3 Our third kernel, illustrating loop-carried dependences, is: y[i] = x[i]* y[i-3]+ a, where the value of x[i] is read from memory, multiplied by y[i-3], incremented by a, and stored in y[i], as shown in the dependence graph of Figure 4a. The backward (bold) edge from the add2 operation to the mult1 operation represents a loop-carried dependence, where the result of the add2 operation is reused by the mult1 operation three iterations later. Its dependence distance is thus !2 1 = 3. ;

The row distance rdist associated with edge (i; j) can be computed using Equation (1). Using rdist and the latency l associated with edge (i; j), the minimum skip factor s can be computed using Equation (4). These input constants are shown in Figure 4c for the MRT-schedule shown in Figure 4b. Using these skip factors and computing the  value, 0 according to Equation (7), for a counter-clockwise traversal of the underlying cycle A, we obtain the underlying cycle constraint shown in Figure 4d. A stage schedule satisfying this constraint is illustrated in Figure 4e. i;j

i;j

i;j

i;j

Example 4 Figures 5a and 5b, respectively, show the dependence graph and stage schedule with minimum

integral MaxLive for the kernel y[i] = (x[i]4+x[i]+a)/ (x[i]+b) which will be used to illustrate our algorithms below. By postponing the schedule of the add2 and add3 operation from times 3 and 2 to times 9 and 8, respectively, the register requirements decrease from 15 to 13. As we will see in the next section, the schedule presented in Figures 5b also achieves minimum fractional MaxLive, thus resulting in the minimum register requirements for this kernel, MRT-schedule, and set of functional unit latencies.

We can use Equation (4) to compute all the s values. The resulting skip factors are shown in Figure 5a. Using these skip factors and computing the  values according to Equation (7), as shown in Figure 5a for counter-clockwise traversal of underlying cycles A and B, the constraints for this kernel correspond to the following set of underlying-cycle constraints, from Equation (6): i;j

Constraint A: p0 1 + p1 4 + p4 5 ? p2 5 ? p0 2 = ?2 Constraint B: p0 2 + p2 5 + p5 6 ? p3 6 ? p0 3 = 0 ;

;

;

;

;

;

;

;

;

;

(8)

It is sucient to introduce one constraint per elementary cycle of the underlying graph, because the constraints of any arbitrary (possibly complex) cycle is simply a linear combination of the constraints of its elementary cycles. For example, the constraint associated with the outer cycle of Figure 5 as computed from Equation (6) with  = 1 from Equation (7): p0 1 + p1 4 + p4 5 + p5 6 ? p3 6 ? p0 3 = ?2 can be obtained by simply adding the constraints for cycles A and B. Therefore, satisfying constraints A and B will necessarily satisfy the constraint associated with the outer cycle. ;

;

9

;

;

;

;

a) Dependence graph with minimum skip factors ( )

Time:0

ld0

a2

m4

1

st7

a5

m1

a3

d6

vr0 ld0 m1 vr1

vr2

vr4

4

st7

5

B

a5 vr5

δA δB

a2

ld0

3

vr3

a2

m4

2 a3

A

vr6 st7

= 0

m1

a3

d6

6

ld0

a2

m4

7

st7

a5

m1

a3

d6

9

ld0

a2

m4

10

st7

a5

m1

a3

d6

a5

m1

11

st7

19

vr0 ld0





m1

m4

a5

8

d6

= 1

c) Dependence graph with additional skip factors (

)

b) Replicated MRT (II = 3)

vr1

a3

−2

vr2

m4

vr3

a2

vr4

0

a5 vr5

d6 vr6

add: mult: div: load: store:

a2,a3,a5 m1, m4 d6 ld0 st7

st7

Figure 5: Skip and delta factors for stage scheduling (Example 4). Although any p values that satisfy these constraints result in a valid stage schedule, we are interested in the solution that minimizes integral MaxLive. In general, the integral part of a virtual register lifetime along a register data ow edge (i; j) is directly proportional to the sum of the minimum and additional skip factors and the dependence distance along that edge. As a result, we can express the integral part of the lifetime from operation i to operation j, ! iterations later, as: int = s + p + ! (9) The integral part of the lifetime associated with vr corresponds to the maximum integral part among all the outgoing register data ow edges of operation i: (! + s + p ) (10) int = 8 s.t.max ( )2 i;j

i;j

i;j

i;j

i;j

i;j

i

i

j

i;j

Ereg

i;j

i;j

i;j

Using the function of Equation (10), we compute the integral MaxLive of the kernel of Example 3 as the sum of the lifetimes over all virtual registers: max(p0 1; p0 2 + 1; p0 3) + p1 4 + p2 5 + p3 6 + p4 5 + p5 6 + p6 7 + 5 (11) We can now reduce the problem of nding a stage schedule that results in the minimum integral MaxLive to a well-known class of problems, solved by a linear programming (LP) solver [22]. Note, however, that (11) is not acceptable as the input to an LP-solver, because the objective function cannot contain any max functions. However, since we are minimizing the objective function, we can remove the max function by using some additional inequalities, called max constraints. Finally, we can remove the constant term in the objective function. The LP-solver input for the kernel and MRT presented in Figure 5 is therefore: Minimize: m0 + p1 4 + p2 5 + p3 6 + p4 5 + p5 6 + p6 7 Constraint A: p0 1 + p1 4 + p4 5 ? p0 2 ? p2 5 = ?2 Constraint B: p0 2 + p2 5 + p5 6 ? p0 3 ? p3 6 = 0 Max Constraints: p0 1  m0 ; p0 2 + 1  m0 ; p0 3  m0 (12) ;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

10

;

;

;

The result of the LP-solver is shown in Figure 5c, i.e. p0 2 = p0 3 = 2 and other p = 0. We can verify that this solution satis es the constraints of Equation (12) and yields m0 = 3 with the other p in the objective function being 0, resulting in an integral MaxLive = m0 +5 = 8 registers. The corresponding stage schedule is shown in Figure 5b. ;

;

i;j

i;j

Algorithm 1 (MinBuf Stage Scheduler) The minimum integral MaxLive for a kernel with a general dependence graph, MRT, and set of functional unit latencies is found as follows: 1. Compute all s using Equation (4) and search for all elementary cycles in the underlying graph of the dependence graph GfV; E g [23]. 2. If the underlying dependence graph is acyclic, the solution that produces the minimum integral MaxLive is obtained by setting the values of all p to zero. 3. Otherwise, build an underlying-cycle constraint for each elementary cycle of the underlying dependence graph using Equations (6) and (7). Derive the integral MaxLive objective function by summing Function (10) over i. Add max constraints for each multiple-use virtual register. 4. Solve the system of constraints to minimize the integral MaxLive by using an LP-solver. The solution de nes the p values that result in the minimum integral MaxLive. Note that the solver may fail to nd any solutions because of some critical recurrence cycles in the dependence graph. In this case, the given MRT has no solution and a new MRT-schedule must be generated, possibly with a di erent II . i;j

sched

i;j

i;j

We prove the correctness of this algorithm with two theorems. The rst theorem validates the correctness of the solution for general dependence graphs, and the second theorem validates the solution and the linear solution time for the case of an acyclic underlying dependence graph.

Theorem 1 The solution of the minimum integral MaxLive LP-problem as de ned in Algorithm 1 results in the minimum integral MaxLive for a kernel with a general dependence graph, MRT, and set of latencies. This solution is guaranteed to be integer valued. Proof: We have already shown that valid stage schedules must satisfy the s and the underlying-cycle i;j

constraints. Therefore, this set of equations de nes the space of feasible schedules. Since the underlyingcycle constraints, the max constraints, and the objective function (integral MaxLive) are linear, we can use an LP-solver to nd a schedule that minimizes integral MaxLive. Since the solution, namely the set of p values, must be integer valued, we must show that the solution of the LP-solver is guaranteed to be integer valued for any dependence graph and MRT-schedule. Chaudhuri et al. [24] have shown that the linear-programming formulation of the scheduling problem with in nite resources (Assignment and Timing Constraints in [24]) is guaranteed to result in integer solutions. Our problem formulation is derived by removing their assignment constraints and by replacing their timing constraints with the underlying constraints, which are obtained by summing their timing constraints associated with the edges of an underlying cycle. Since eliminating or summing constraints preserves the integer property [22, pp 540{541], the linear-programming formulation of the minimum integral MaxLive problem is also guaranteed to result in integer solutions. 2 i;j

Theorem 2 For a dependence graph with an acyclic underlying dependence graph, the minimum integral MaxLive for a given kernel, MRT, and set of latencies is found in a time that is linear in the number of edges in the dependence graph. The minimum integral MaxLive is found by simply setting all p values to zero. i;j

Proof: Consider applying the general form of Algorithm 1 (Steps 1, 3, and 4) to an acyclic underlying dependence graph. There are no underlying cycles, and hence no underlying-cycle constraints, in the linear program problem formulation, e.g. see (12). The max constraints all have the form p  m + c where i;j

i

i;j

m and p are to be found and c is a constant. The objective is to minimize a linear combination of the m and p , all of which have non-negative coecients. Consequently the solution cost is monotonically non-decreasing in the m and p . Setting all p = 0, their minimum possible value, also allows the m to i

i;j

i

i;j

i;j

i

i;j

i;j

i

11

have their minimum feasible value and minimizes the objective function. Therefore, Steps 3 and 4 take no time and Step 2 of Algorithm 1 can be used instead. The solution time for Algorithm 1 is thus reduced to the time for Step 1. The s are each computed in constant time using Equation (4). Since there is one s parameter per edge in the dependence graph, the solution time is linear in the number of edges. 2 We conclude this section by contrasting the complexity of our stage scheduling model (summarized in Figure 6) with the modulo scheduling model presented by Govindarajan et al [17], where both models minimize integral MaxLive for machine with nite resources, but the former assumes a given MRT and the latter does not. The most important di erence is that the stage scheduling model can be solved with an LP-solver in polynomial time complexity instead of an integer linear programming solver in exponential time complexity. Furthermore, the stage scheduling model has a vastly smaller number of constraints. For example, all the resource constraints are resolved prior to stage scheduling during the MRT-scheduling phase, and therefore are ignored in the stage scheduling model. Also, most of the dependence constraints present in the modulo scheduling model disappear in the stage scheduling model as most of them are resolved statically by selecting appropriate minimum skip factors (s ). The only remaining form of dependence constraints appears as underlying cycle constraints. Similarly, most of the constraints required to compute the register requirements associated with a solution are also computed statically in the stage scheduling model. The only p that actually must be accounted for in the objective function are the ones with operation i in an underlying cycle; a corollary of Theorem 2 insures that all other p will be 0 in an optimum stage schedule. As a result, the stage scheduling model results in solution times that are signi cantly smaller than the modulo scheduling model. i;j

i;j

i;j

i;j

i;j

5 Stage Scheduling for Minimum MaxLive In this section, we extend the previous stage scheduling algorithm to consider both the fractional and the integral register requirements. We rst quantify the fractional MaxLive associated with a kernel, an MRT, and a stage schedule. Then, we investigate the interaction between the fractional MaxLive and the search for an optimal stage schedule. We conclude the section by presenting an algorithm that minimizes the total MaxLive among all stage schedules for a given MRT. When computing the fractional part of a lifetime, we must determine which is the last-use operation of a virtual register. In general, the last-use operation of vr is de ned as an operation that maximizes the following expression: rdist + (! + s + p )  II 8j s.t. (i; j) 2 E (13) We introduce two terms related to the last-use operation of vr : last operation that last uses vr , i.e. operation j that maximizes Equation (13). rdist i distance from the row of operation i to the next row of operation last , possibly in the next instance of the MRT, i.e. rdist i = dist(row ; row i ). The distance rdist i + 1 precisely de nes the length of the fractional part, i.e. the number of rows covered by the fractional part of the lifetime associated with vr . Thus, we know that in our virtual register model the fractional part of vr contributes to the register requirement of a loop iteration for rdist i + 1 cycles, starting from the row in which vr is rst de ned, namely row , up through row i (with wraparound). As a result, we may express the fractional part of vr in row r as:  1 if dist(row ; r) < dist(row ; row i ) + 1 (14) frac = 0 otherwise Using Equation (14) for each virtual register, we may write the total contribution of the fractional parts for a given row r as: i

i;j

i;j

i;j

i;j

reg

i

i

i

i;last

i

i

i;last

last

i;last

i

i

i

i;last

i

last

i

i

i

last

r;i

X

i

s.t.

frac

2

(i;lasti )

Ereg

12

r;i

(15)

Minimize:

MaxBu =

X

Subject to: Underlying Cycles:

m

i

2

i

V

X

sign (u) (s + p ) = ? i;j

2

(i;j )

i;j

i;j

! +s +p  m

Max:

i;j

De nitions:

u

8u 2 U

G

u

i;j

i;j

8(i; j) 2 E

i

reg

rdist = (row ? row ) mod II X sign (u) rdist  = II1  ( )2 rdist  ? ! s = l ?II m ; p : integer;  0 i;j

j

i;j

u

i;j

i;j

i;j

u

i;j

i;j

i

i

i;j

i;j

Figure 6: Stage scheduling model for minimum integral MaxLive (minimum bu er).

Machine Graph

l G = fV; E i;j

U

G

i;j

i;j

p

i;j

reg

sched

!

MRT schedule row II Results s

sched

latency between operation i and operation j. ; E g dependence graph with scheduling edge set and register data

ow edge set. set of underlying cycles in the underlying graph of the dependence graph GfV; E g. dependence distance (in iterations) between operation i and operation j. row number in which operation i is scheduled. initiation interval. minimum (integer) skip factor necessary to hide the latency l of the scheduling edge (i; j). additional (non-negative integer) skip factor used to postpone operation j further.

i

i;j

Table 1: Variables for the stage scheduling model.

13

a) Original dependence graph

d) Dependence graph with additional constraints

vr0 ld0

vr0 ld0

m1

a3

vr1

m1

vr3

a2

a3

vr1

vr3

a2

vr2

vr2

m4

m4

vr4

vr4 a5

a5

vr5

vr5 d6

d6

vr6

vr6

st7

st7

c) Fractional parts

b) MRT (II = 3)

vr0 row:0

ld0

a2

m4

1

st7

a5

m1

a3

d6

2

vr0

a2 last−use of vr0

vr0

vr1

vr2

vr3

m1 last−use of vr0

vr4

vr5

vr6

a3 last−use of vr0

Figure 7: Fractional part of the lifetimes (Example 4). where the total contribution sums the fractional parts for all virtual registers. Since the fractional MaxLive is determined from the maximum number of live virtual registers in any single row, we can write the total MaxLive for a given kernel, MRT and stage schedule as: R

M RT

X

= i

max ?1

r =0;:::;II

i

s.t. s.t.

2

(i;lasti )

X

Ereg

(! 

i;lasti

+s

i;lasti

+p

2

Ereg

)+

1 if dist(row ; r) < dist(row ; row 0 otherwise i

(i;lasti )

i;lasti

i

lasti

)+1



(16)

where the rst and second terms correspond, respectively, to the integral and the fractional MaxLive. We can now formalize the interaction between the fractional and the integral part of the register requirements. Consider a loop iteration with a virtual register, vr , used by several operations. After having found a stage schedule with minimum integral MaxLive, using Algorithm 1, we can determine which of the use operations of vr is scheduled last, i.e. which operation x maximizes Equation (13). We know that Algorithm 1 produces schedules with minimum integral MaxLive; however, there is no guarantee that the schedules produced also minimize the fractional parts of the register requirements. For example, it is possible that fractional MaxLive could be decreased by scheduling operation y instead of x as a last-use operation of vr . Note that we need only consider vr if it has multiple use operations and at least one of them is in an underlying cycle. Consider an operation y that uses vr and that reduces fractional MaxLive if scheduled last instead of operation x. The question now is the following: \Is there a stage schedule that schedules operation y as last-use operation of vr , thus reducing fractional MaxLive, without increasing total MaxLive?" To answer this question, we simply force operation y to be scheduled after operation x by adding new scheduling constraints in the problem formulation, i.e. additional scheduling edges in the dependence graph. We can show that we must introduce at least one new constraint for each operation with larger fractional i

i

i

i

i

i

14

part than operation y, i.e. ! +s +p < ! +s +p i;x

i;x

i;x

i;y

i;y

i;y

8 x s.t. (i; x) 2 E

reg

and rdist > rdist i;x

i;y

(17)

In general, we must consider each combination of such y operations (with no more than one such y for each vr) and investigate if the reduction in fractional MaxLive results in a reduced total MaxLive for any combination, and if so, choose the combination that yields the largest reduction. To illustrate this process, consider Example 4. As shown in Figures 7a and 7b, three operations use vr0 , mult1, add2, and add3 scheduled in rows 1, 0, and 2, respectively. The fractional parts associated with Example 4 are shown in Figure 7c with three di erent cases for vr0 depending on which of the three operations is the last use of vr0 . By counting the number of fractional parts for each row and taking the maximum number over the three rows, we notice that the fractional part is 6, unless if add2 is the last-use operation of vr0 , in which case the fractional part is 5. When searching for a stage schedule with minimum integral MaxLive, in Figure 5, we initially obtained a schedule with an integral MaxLive of 8 and a fractional MaxLive of 6, as add2 was scheduled II cycles earlier and thus add3 was the last-use operation of vr0 . We then investigated whether there was a stage schedule that would schedule the add2 operation last to decrease the fractional MaxLive by 1 without increasing the integral MaxLive. Thus, we introduced two additional constraints, as presented in Equation (17), resulting in the two additional (bold) scheduling edges in the dependence graph shown in Figures 7d. Solving for a stage schedule with minimum integral MaxLive on this new dependence graph also results in a schedule with an integral MaxLive of 8, thus achieving a lower fractional MaxLive without increasing integral MaxLive. The resulting stage schedule is the one shown in Figure 5.

Algorithm 2 (MinReg Stage-Scheduler) The minimum total MaxLive for a kernel with a general de-

pendence graph, MRT, and set of functional unit latencies is: 1. Use Algorithm 1 to compute the minimum integral MaxLive. We refer to this resulting stage-schedule as the base solution. In case Algorithm 1 fails to produce a stage schedule, because of some critical recurrence cycles in the dependence graph, a new MRT-schedule must be generated, possibly with a di erent II . 2. If the underlying dependence graph is acyclic, the base solution results in the minimum MaxLive. 3. Otherwise, compute the total MaxLive associated with the base solution using Equation (16) and determine all sets of additional constraints, as de ned in Equation (17), that may decrease the fractional MaxLive. Interesting sets to be evaluated consist of all possible forcing combinations of one attractive last-use for some or all virtual registers whose fractional lifetime can be decreased. 4. Use Algorithm 1 repeatedly to compute the minimum integral MaxLive for the base system augmented by each of the sets of additional constraints, in turn. The schedule among these that achieves the smallest total MaxLive is optimum.

We prove the correctness of this algorithm with two theorems. The rst theorem validates the correctness of the solution for general dependence graphs, and the second for acyclic underlying graphs.

Theorem 3 Algorithm 2 produces the minimum MaxLive for a kernel with a general dependence graph, MRT, and set of latencies. This solution is guaranteed to be integer valued.

Proof: Since we test all the combinations of last-use operations that potentially reduce the fractional MaxLive, and since Algorithm 1 produces the minimal integral MaxLive for the augmented constraint set, the solutions of Algorithm 2 and the original unaugmented solution must include the minimum achievable total MaxLive for this MRT. It is easy to see that the additional set of constraints as de ned in Equation (17) also satis es the guaranteed integer solution property for LP problems. Therefore, the solution produced by an LP-solver for each use of Algorithm 1 is guaranteed to be integer valued. 2 15

While there is potentially an exponential number of additional constraints to test, this number in practice tends to be relatively small for several reasons. First, the number of tests is directly dependent on the number of cycles in the underlying dependence graph, which is usually a small number. Second, we have to consider only those cases that potentially reduce the fractional MaxLive. We do not need to consider use operations unless they may impact the MRT row that could govern the second term of Equation (16). Finally, we can use some inclusion properties to reduce the total number of tests.

Theorem 4 For dependence graphs with acyclic underlying graphs, the solution that minimizes the integral MaxLive also minimizes the total MaxLive for a given kernel, MRT, and set of latencies.

Proof: In general, the solution that minimizes the integral MaxLive may not be a solution with the minimal fractional register requirements. Recall, however, that the Algorithm 1 solutions for acyclic underlying graphs has all p = 0. Therefore, reducing a fractional lifetime for some vr requires setting some p to 1, increasing the integer lifetime for vr by 1. Therefore, reducing the fractional register requirements for vr cannot decrease the MaxLive of this virtual register. As the lifetime of each vr is independent of the lifetime of any other vr for an acyclic underlying graph, it is never advantageous to force a di erent last-use operation unless the total lifetime of the vr is decreased. Thus, we cannot reduce the register requirements of the base solution. 2 i;j

i

i;j

i

i

6 Measurements In this section, we investigate the register requirements of the two stage-scheduling algorithms of this paper. For purpose of comparison, we also present the register requirements of an ecient modulo scheduling algorithm that presently does not attempt to minimize the register requirements. Unfortunately, we are unable to provide a comparison with Hu 's scheduling algorithm [11] since his machine model di ers slightly from ours and his latest scheduler, presented in [11], was not available to us.

MinReg Stage-Scheduler: This scheduler minimizes MaxLive over all valid modulo schedules that share a given MRT, using Algorithm 2 which was presented in Section 5. The resulting schedule has the lowest achievable register requirements for the given machine, loop iteration, and MRT. MinBuf Stage-Scheduler: This scheduler minimizes integer MaxLive over all valid modulo schedules that share a given MRT, using Algorithm 1 which was presented in Section 4. The resulting schedule has the lowest achievable bu er requirements for the given machine, loop iteration, and MRT. Bu ers must be reserved for a time interval that is a multiple of II, whereas registers may be reserved for arbitrary time periods. We derive and use in our comparisons the actual register requirements associated with these MinBuf stage schedules. Iterative Modulo Scheduler[25]: This scheduler has been designed to deal eciently with realistic machine models while producing schedules with near optimal steady-state throughput. Experimental ndings show that this algorithm requires the scheduling of only 59% more operations than does acyclic list scheduling. At the same time it results in schedules that are optimal in II for 96% of the loops in their benchmark [25]. In its current form, the scheduler does not attempt to minimize the register requirements of its schedules; however, the register requirements of its schedules may be reduced signi cantly by the simple heuristics presented in [15]. Operations Operations Underlying Register Scheduling in cycles cycles edges edges minimum 2 0 0 1 1 average 17.5 10.2 3.0 21.4 22.5 maximum 161 158 66 220 232 Table 2: Characteristics of the benchmark suite. 16

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 0.5 0.4 0.3

Schedule-Independent Lower Bound

0.2

MinReg Stage Scheduler

0.1

Iterative Modulo Scheduler

0 0

16

32

48

64

80 96 MaxLive

112

128

144

160

Figure 8: Register requirements of the benchmark suite. We use a benchmark of loops obtained from the Perfect Club [26], SPEC-89 [27], and the Livermore Fortran Kernels [28]. Our benchmark consists exclusively of innermost loops with no early exits, no procedure calls, and fewer than 30 basic blocks, as compiled by the Cydra 5 Fortran77 compiler [29]. The input to the three scheduling algorithms consists of the Fortran77 compiler intermediate representation after load-store elimination, recurrence back-substitution, and IF-conversion. Our benchmark suite consists of the 1327 loops successfully modulo scheduled by the Cydra 5 Fortran77 compiler. The MinReg Stage-Scheduler found an optimal solution for 1296 of the 1327 loops. For the remaining 31 loops, the scheduler exceeded the 512 augmented problem limit (Algorithm 2, Step 4) used in our measurements. Table 2 summarizes the principal characteristics of the benchmark suite, reporting the minimum, average, and maximum value for each entry. On average, more than 37% of the operations in a loop iteration belong to an underlying cycle and more than 62% of the operations in a loop iteration belong to an underlying cycle or self-loop in the underlying dependence graph. The machine model used in these experiments corresponds to the Cydra 5 machine. This choice was motivated by the availability of quality code for this machine. Also, the resource requirements of the Cydra 5 machine are complex [14], thus stressing the importance of good and robust scheduling algorithms. In particular, the machine con guration is the one used in [25] with 7 functional units (2 memory port units, 2 address generation units, 1 FP adder unit, 1 FP multiplier unit, and 1 branch unit). The Iterative Modulo Scheduler was also used to generate the MRT input for the MinReg and MinBuf Stage-Schedulers. We rst investigate the register requirements of the MinReg Stage Scheduler which results in the lowest register requirements over all valid modulo schedules that share a common MRT. Figure 8 presents the fraction of the loops in the benchmark suite that can be scheduled for a machine with a given number of registers without spilling and without increasing II. In this graph, the X-axis represents MaxLive and the Y-axis represents the fraction of loops scheduled on a machine with up to MaxLive physical registers. The \Schedule Independent Lower Bound" curve corresponds to the bound presented in [11] and is representative of the register requirements of a machine with unlimited resources. There is a signi cant gap between the MinReg Stage Scheduler curve and the Lower Bound curve which we believe is caused by two factors. First, the MinReg Stage Scheduler searches for a schedule with the lowest register requirements only among the modulo schedules that share the given MRT. Therefore, when the MRT given to the stage scheduler is suboptimal, the stage scheduler will nd a local minimum that may be signi cantly larger than 17

1

Fraction of Loops Scheduled

0.9 0.8 0.7 0.6 0.5 0.4 0.3

MinBuf Stage Scheduler

0.2

Iterative Modulo Scheduler

0.1 0 0

4

8

12

16

20 24 28 32 36 Additional MaxLive

40

44

48

52

Figure 9: Additional register requirements over MinReg Stage-Scheduler. the absolute minimum.2 The second factor is that the schedule-independent lower bound may be signi cantly too optimistic due to the complexity of some of the loops in the benchmark suite and the fact that this lower bound ignores the resource constraints of the machine. The lowest curve presents the register requirements of the Iterative Modulo Scheduler, which presently does not attempt to minimize the register requirements. We see that stage scheduling for minimum register requirements signi cantly reduces the register requirements of the loops in the benchmark suite considered. In general, stage scheduling will make a signi cant di erence for schedules in which the number of stages and the number of operations not on the critical path are not too small. Trying di erent MRTs for a given loop iteration is likely to become increasingly important as II increases. The gap between these last two curves also indicates the degree of improvement that might be achieved by stage scheduling heuristics. average maximum standard deviation

MinBuf Stage Scheduler MinReg Stage Scheduler 112.2 ms 23.5 sec 17.6 sec 108.3 min 816.6 ms 268.4 sec

Table 3: Computation time on a SPARC-20. Figure 9 illustrates the number of additional registers needed when the other scheduling algorithms are used instead of the MinReg Stage-Scheduler. The upper curve corresponds to the additional register requirements of the MinBuf Stage-Scheduler, which nds a stage-schedule with no additional register requirements in 94% of the loops. The MinBuf Stage-Scheduler schedules 98% of the loops with no more than 1 additional register but needs up to 7 additional registers for the remaining 2%. The Iterative Modulo Scheduler results in a stage-schedule with no additional register requirements in 42% of the loops, a surprisingly high number for a scheduler that does not attempt to minimize the register requirements. This result may be partially Experimental evidence on a subset of this benchmark suite, consisting of all the loops with no more than 12 operations and up to 5, for which an optimal modulo schedule was sought over all MRTs, showed, however, that the di erence between the local and absolute minimum was surprisingly small[20]. This result may not hold for larger loops. 2

II

18

explained by the fact that this scheduler does attempt to minimize the length of a schedule, which generally results in a stage-schedule with low register requirements along the critical path of that schedule. The computation time for the MinReg and MinBuf stage schedulers for the average and the maximum loop and the standard deviation are given in Table 3.

7 Conclusions Modulo scheduling is an ecient technique for exploiting instruction level parallelism in a variety of loops. It results in high performance code, but increases the register requirements. As the trend toward higher concurrency continues, whether due to using and exploiting machines with faster clocks and deeper pipelines, wider issue, or a combination of both, the register requirements will increase even more. As a result, scheduling algorithms that reduce register pressure while scheduling for high throughput are increasingly important. This paper presents an approach that schedules the operations of a loop iteration to achieve the minimum register requirements for a given a modulo reservation table. This method nds a stage-schedule with minimum register requirements for general dependence graphs on machines with nite resources. When known to be optimal, a linear-time algorithm is used; otherwise, a linear programming approach is used. We can also quickly determine when the faster algorithm is applicable. This paper demonstrates by example that selecting a good stage-schedule among all schedules that share the same MRT can result in lower register requirements. Measurements on a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels shows that the register requirements decrease by 24.5% on average when applying the optimal stage scheduler to the MRT-schedules of a registerinsensitive modulo scheduler. Though the general algorithm may be too slow for general use in a production compiler it may be extremely useful in special situations. This algorithm is also useful in evaluating the performance of register-sensitive modulo scheduling heuristics.

Acknowledgment This work was supported in part by the Oce of Naval Research under grant number N00014-93-1-0163 and by Hewlett-Packard. The authors would like to thank B. Ramakrishna Rau for his many insights, useful suggestions, and for providing the input data set.

References

[1] B. R. Rau and J. A. Fisher. Instruction-level parallel processing: History, overview, and perspective. In The Journal of Supercomputing, volume 7, pages 9{50, 1993. [2] P. Y. Hsu. Highly Concurrent Scalar Processing. PhD thesis, University of Illinois at Urbana-Champaign, 1986. [3] B. R. Rau, C. D. Glaeser, and R. L. Picard. Ecient code generation for horizontal architectures: Compiler techniques and architecture support. Proceedings of the Ninth Annual International Symposium on Computer Architecture, pages 131{139, 1982. [4] M. Lam. Software pipelining: An e ective scheduling technique for VLIW machines. Proceedings of the ACM SIGPLAN'88 Conference on Programming Language Design and Implementation, pages 318{328, June 1988. [5] N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170{179, December 1992. [6] N. J. Warter. Modulo Scheduling with Isomorphic Control Transformations. PhD thesis, University of Illinois at Urbana-Champaign, 1994. [7] P. P. Tirumalai, M. Lee, and M. S. Schlansker. Parallelization of loops with exits on pipelined architectures. Proceedings of Supercomputing '90, pages 200{212, November 1990. [8] B. R. Rau, M. Lee, P. P. Tirumalai, and M. S. Schlansker. Register allocation for software pipelined loops. Proceedings of the ACM SIGPLAN'92 Conference on Programming Language Design and Implementation, pages 283{299, June 1992.

19

[9] W. Mangione-Smith, S. G. Abraham, and E. S. Davidson. Register requirements of pipelined processors. Proceedings of the International Conference on Supercomputing, pages 260{271, July 1992. [10] C. Eisenbeis and D. Windheiser. Optimal software pipelining in presence of resource constraints. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, August 1993. [11] R. A. Hu . Lifetime-sensitive modulo scheduling. Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation, pages 258{267, June 1993. [12] J. H. Patel and E. S. Davidson. Improving the throughput of a pipeline by insertion of delays. Proceedings of the Third Annual International Symposium on Computer Architecture, pages 159{164, 1976. [13] C. Eisenbeis, W. Jalby, and A. Lichnewsky. Squeezing more performance out of a Cray-2 by vector block scheduling. Proceedings of Supercomputing '88, pages 237{246, November 1988. [14] G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 mini-supercomputer: Architecture and implementation. In The Journal of Supercomputing, volume 7, pages 143{180, 1993. [15] A. E. Eichenberger and E. S. Davidson. Stage scheduling: A technique to reduce the register requirements of a modulo schedule. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 180{191, November 1995. [16] Q. Ning and G. R. Gao. A novel framework of register allocation for software pipelining. Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 29{42, 1993. [17] R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 85{94, November 1994. [18] J. Wang, A. Krall, and M.A. Ertl. Decomposed software pipelining with reduced register requirement. In Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, June 1995. [19] Dupont de Dinechin. Simplex scheduling: More than lifetime-sensitive instruction scheduling. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, 1994. [20] A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Optimum modulo schedules for minimum register requirements. Proceedings of the International Conference on Supercomputing, pages 31{40, July 1995. [21] J. Llosa, M. Valero, E. Ayguade, and A. Gonzalez. Hypernode reduction modulo scheduling. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 350{360, November 1995. [22] G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. Wiley, New York, 1988. [23] K. Paton. An algorithm for nding a fundamental set of cycles of a graph. Communications of the ACM, 12(9):514{518, September 1969. [24] S. Chaudhuri, R. A. Walker, and J. E. Mitchell. Analysing and exploiting the structure of the constraints in the ILP approach to the scheduling problem. IEEE Transactions on Very Large Scale Integration Systems, 2(4):456{471, December 1994. [25] B. R. Rau. Iterative Modulo Scheduling: An algorithm for software pipelining loops. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63{74, November 1994. [26] M. Berry et al. The Perfect Club Benchmarks: E ective performance evaluation of supercomputers. The International Journal of Supercomputer Applications, 3(3):5{40, Fall 1989. [27] J. Uniejewski. SPEC Benchmark Suite: Designed for today's advanced system. SPEC Newsletter, Fall 1989. [28] F. H. McMahon. The Livermore Fortran Kernels: A computer test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory, Livermore, California, 1986. [29] J. C. Dehnert and R. A. Towle. Compiling for the Cydra 5. In The Journal of Supercomputing, volume 7, pages 181{227, 1993.

20