)
b) Replicated MRT (II = 3)
vr1
a3
−2
vr2
m4
vr3
a2
vr4
0
a5 vr5
d6 vr6
add: mult: div: load: store:
a2,a3,a5 m1, m4 d6 ld0 st7
st7
Figure 5: Skip and delta factors for stage scheduling (Example 4). Although any p values that satisfy these constraints result in a valid stage schedule, we are interested in the solution that minimizes integral MaxLive. In general, the integral part of a virtual register lifetime along a register data ow edge (i; j) is directly proportional to the sum of the minimum and additional skip factors and the dependence distance along that edge. As a result, we can express the integral part of the lifetime from operation i to operation j, ! iterations later, as: int = s + p + ! (9) The integral part of the lifetime associated with vr corresponds to the maximum integral part among all the outgoing register data ow edges of operation i: (! + s + p ) (10) int = 8 s.t.max ( )2 i;j
i;j
i;j
i;j
i;j
i;j
i
i
j
i;j
Ereg
i;j
i;j
i;j
Using the function of Equation (10), we compute the integral MaxLive of the kernel of Example 3 as the sum of the lifetimes over all virtual registers: max(p0 1; p0 2 + 1; p0 3) + p1 4 + p2 5 + p3 6 + p4 5 + p5 6 + p6 7 + 5 (11) We can now reduce the problem of nding a stage schedule that results in the minimum integral MaxLive to a well-known class of problems, solved by a linear programming (LP) solver [22]. Note, however, that (11) is not acceptable as the input to an LP-solver, because the objective function cannot contain any max functions. However, since we are minimizing the objective function, we can remove the max function by using some additional inequalities, called max constraints. Finally, we can remove the constant term in the objective function. The LP-solver input for the kernel and MRT presented in Figure 5 is therefore: Minimize: m0 + p1 4 + p2 5 + p3 6 + p4 5 + p5 6 + p6 7 Constraint A: p0 1 + p1 4 + p4 5 ? p0 2 ? p2 5 = ?2 Constraint B: p0 2 + p2 5 + p5 6 ? p0 3 ? p3 6 = 0 Max Constraints: p0 1 m0 ; p0 2 + 1 m0 ; p0 3 m0 (12) ;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
10
;
;
;
The result of the LP-solver is shown in Figure 5c, i.e. p0 2 = p0 3 = 2 and other p = 0. We can verify that this solution satis es the constraints of Equation (12) and yields m0 = 3 with the other p in the objective function being 0, resulting in an integral MaxLive = m0 +5 = 8 registers. The corresponding stage schedule is shown in Figure 5b. ;
;
i;j
i;j
Algorithm 1 (MinBuf Stage Scheduler) The minimum integral MaxLive for a kernel with a general dependence graph, MRT, and set of functional unit latencies is found as follows: 1. Compute all s using Equation (4) and search for all elementary cycles in the underlying graph of the dependence graph GfV; E g [23]. 2. If the underlying dependence graph is acyclic, the solution that produces the minimum integral MaxLive is obtained by setting the values of all p to zero. 3. Otherwise, build an underlying-cycle constraint for each elementary cycle of the underlying dependence graph using Equations (6) and (7). Derive the integral MaxLive objective function by summing Function (10) over i. Add max constraints for each multiple-use virtual register. 4. Solve the system of constraints to minimize the integral MaxLive by using an LP-solver. The solution de nes the p values that result in the minimum integral MaxLive. Note that the solver may fail to nd any solutions because of some critical recurrence cycles in the dependence graph. In this case, the given MRT has no solution and a new MRT-schedule must be generated, possibly with a dierent II . i;j
sched
i;j
i;j
We prove the correctness of this algorithm with two theorems. The rst theorem validates the correctness of the solution for general dependence graphs, and the second theorem validates the solution and the linear solution time for the case of an acyclic underlying dependence graph.
Theorem 1 The solution of the minimum integral MaxLive LP-problem as de ned in Algorithm 1 results in the minimum integral MaxLive for a kernel with a general dependence graph, MRT, and set of latencies. This solution is guaranteed to be integer valued. Proof: We have already shown that valid stage schedules must satisfy the s and the underlying-cycle i;j
constraints. Therefore, this set of equations de nes the space of feasible schedules. Since the underlyingcycle constraints, the max constraints, and the objective function (integral MaxLive) are linear, we can use an LP-solver to nd a schedule that minimizes integral MaxLive. Since the solution, namely the set of p values, must be integer valued, we must show that the solution of the LP-solver is guaranteed to be integer valued for any dependence graph and MRT-schedule. Chaudhuri et al. [24] have shown that the linear-programming formulation of the scheduling problem with in nite resources (Assignment and Timing Constraints in [24]) is guaranteed to result in integer solutions. Our problem formulation is derived by removing their assignment constraints and by replacing their timing constraints with the underlying constraints, which are obtained by summing their timing constraints associated with the edges of an underlying cycle. Since eliminating or summing constraints preserves the integer property [22, pp 540{541], the linear-programming formulation of the minimum integral MaxLive problem is also guaranteed to result in integer solutions. 2 i;j
Theorem 2 For a dependence graph with an acyclic underlying dependence graph, the minimum integral MaxLive for a given kernel, MRT, and set of latencies is found in a time that is linear in the number of edges in the dependence graph. The minimum integral MaxLive is found by simply setting all p values to zero. i;j
Proof: Consider applying the general form of Algorithm 1 (Steps 1, 3, and 4) to an acyclic underlying dependence graph. There are no underlying cycles, and hence no underlying-cycle constraints, in the linear program problem formulation, e.g. see (12). The max constraints all have the form p m + c where i;j
i
i;j
m and p are to be found and c is a constant. The objective is to minimize a linear combination of the m and p , all of which have non-negative coecients. Consequently the solution cost is monotonically non-decreasing in the m and p . Setting all p = 0, their minimum possible value, also allows the m to i
i;j
i
i;j
i;j
i
i;j
i;j
i
11
have their minimum feasible value and minimizes the objective function. Therefore, Steps 3 and 4 take no time and Step 2 of Algorithm 1 can be used instead. The solution time for Algorithm 1 is thus reduced to the time for Step 1. The s are each computed in constant time using Equation (4). Since there is one s parameter per edge in the dependence graph, the solution time is linear in the number of edges. 2 We conclude this section by contrasting the complexity of our stage scheduling model (summarized in Figure 6) with the modulo scheduling model presented by Govindarajan et al [17], where both models minimize integral MaxLive for machine with nite resources, but the former assumes a given MRT and the latter does not. The most important dierence is that the stage scheduling model can be solved with an LP-solver in polynomial time complexity instead of an integer linear programming solver in exponential time complexity. Furthermore, the stage scheduling model has a vastly smaller number of constraints. For example, all the resource constraints are resolved prior to stage scheduling during the MRT-scheduling phase, and therefore are ignored in the stage scheduling model. Also, most of the dependence constraints present in the modulo scheduling model disappear in the stage scheduling model as most of them are resolved statically by selecting appropriate minimum skip factors (s ). The only remaining form of dependence constraints appears as underlying cycle constraints. Similarly, most of the constraints required to compute the register requirements associated with a solution are also computed statically in the stage scheduling model. The only p that actually must be accounted for in the objective function are the ones with operation i in an underlying cycle; a corollary of Theorem 2 insures that all other p will be 0 in an optimum stage schedule. As a result, the stage scheduling model results in solution times that are signi cantly smaller than the modulo scheduling model. i;j
i;j
i;j
i;j
i;j
5 Stage Scheduling for Minimum MaxLive In this section, we extend the previous stage scheduling algorithm to consider both the fractional and the integral register requirements. We rst quantify the fractional MaxLive associated with a kernel, an MRT, and a stage schedule. Then, we investigate the interaction between the fractional MaxLive and the search for an optimal stage schedule. We conclude the section by presenting an algorithm that minimizes the total MaxLive among all stage schedules for a given MRT. When computing the fractional part of a lifetime, we must determine which is the last-use operation of a virtual register. In general, the last-use operation of vr is de ned as an operation that maximizes the following expression: rdist + (! + s + p ) II 8j s.t. (i; j) 2 E (13) We introduce two terms related to the last-use operation of vr : last operation that last uses vr , i.e. operation j that maximizes Equation (13). rdist i distance from the row of operation i to the next row of operation last , possibly in the next instance of the MRT, i.e. rdist i = dist(row ; row i ). The distance rdist i + 1 precisely de nes the length of the fractional part, i.e. the number of rows covered by the fractional part of the lifetime associated with vr . Thus, we know that in our virtual register model the fractional part of vr contributes to the register requirement of a loop iteration for rdist i + 1 cycles, starting from the row in which vr is rst de ned, namely row , up through row i (with wraparound). As a result, we may express the fractional part of vr in row r as: 1 if dist(row ; r) < dist(row ; row i ) + 1 (14) frac = 0 otherwise Using Equation (14) for each virtual register, we may write the total contribution of the fractional parts for a given row r as: i
i;j
i;j
i;j
i;j
reg
i
i
i
i;last
i
i
i;last
last
i;last
i
i
i
i;last
i
last
i
i
i
last
r;i
X
i
s.t.
frac
2
(i;lasti )
Ereg
12
r;i
(15)
Minimize:
MaxBu =
X
Subject to: Underlying Cycles:
m
i
2
i
V
X
sign (u) (s + p ) = ? i;j
2
(i;j )
i;j
i;j
! +s +p m
Max:
i;j
De nitions:
u
8u 2 U
G
u
i;j
i;j
8(i; j) 2 E
i
reg
rdist = (row ? row ) mod II X sign (u) rdist = II1 ( )2 rdist ? ! s = l ?II m ; p : integer; 0 i;j
j
i;j
u
i;j
i;j
i;j
u
i;j
i;j
i
i
i;j
i;j
Figure 6: Stage scheduling model for minimum integral MaxLive (minimum buer).
Machine Graph
l G = fV; E i;j
U
G
i;j
i;j
p
i;j
reg
sched
!
MRT schedule row II Results s
sched
latency between operation i and operation j. ; E g dependence graph with scheduling edge set and register data
ow edge set. set of underlying cycles in the underlying graph of the dependence graph GfV; E g. dependence distance (in iterations) between operation i and operation j. row number in which operation i is scheduled. initiation interval. minimum (integer) skip factor necessary to hide the latency l of the scheduling edge (i; j). additional (non-negative integer) skip factor used to postpone operation j further.
i
i;j
Table 1: Variables for the stage scheduling model.
13
a) Original dependence graph
d) Dependence graph with additional constraints
vr0 ld0
vr0 ld0
m1
a3
vr1
m1
vr3
a2
a3
vr1
vr3
a2
vr2
vr2
m4
m4
vr4
vr4 a5
a5
vr5
vr5 d6
d6
vr6
vr6
st7
st7
c) Fractional parts
b) MRT (II = 3)
vr0 row:0
ld0
a2
m4
1
st7
a5
m1
a3
d6
2
vr0
a2 last−use of vr0
vr0
vr1
vr2
vr3
m1 last−use of vr0
vr4
vr5
vr6
a3 last−use of vr0
Figure 7: Fractional part of the lifetimes (Example 4). where the total contribution sums the fractional parts for all virtual registers. Since the fractional MaxLive is determined from the maximum number of live virtual registers in any single row, we can write the total MaxLive for a given kernel, MRT and stage schedule as: R
M RT
X
= i
max ?1
r =0;:::;II
i
s.t. s.t.
2
(i;lasti )
X
Ereg
(!
i;lasti
+s
i;lasti
+p
2
Ereg
)+
1 if dist(row ; r) < dist(row ; row 0 otherwise i
(i;lasti )
i;lasti
i
lasti
)+1
(16)
where the rst and second terms correspond, respectively, to the integral and the fractional MaxLive. We can now formalize the interaction between the fractional and the integral part of the register requirements. Consider a loop iteration with a virtual register, vr , used by several operations. After having found a stage schedule with minimum integral MaxLive, using Algorithm 1, we can determine which of the use operations of vr is scheduled last, i.e. which operation x maximizes Equation (13). We know that Algorithm 1 produces schedules with minimum integral MaxLive; however, there is no guarantee that the schedules produced also minimize the fractional parts of the register requirements. For example, it is possible that fractional MaxLive could be decreased by scheduling operation y instead of x as a last-use operation of vr . Note that we need only consider vr if it has multiple use operations and at least one of them is in an underlying cycle. Consider an operation y that uses vr and that reduces fractional MaxLive if scheduled last instead of operation x. The question now is the following: \Is there a stage schedule that schedules operation y as last-use operation of vr , thus reducing fractional MaxLive, without increasing total MaxLive?" To answer this question, we simply force operation y to be scheduled after operation x by adding new scheduling constraints in the problem formulation, i.e. additional scheduling edges in the dependence graph. We can show that we must introduce at least one new constraint for each operation with larger fractional i
i
i
i
i
i
14
part than operation y, i.e. ! +s +p < ! +s +p i;x
i;x
i;x
i;y
i;y
i;y
8 x s.t. (i; x) 2 E
reg
and rdist > rdist i;x
i;y
(17)
In general, we must consider each combination of such y operations (with no more than one such y for each vr) and investigate if the reduction in fractional MaxLive results in a reduced total MaxLive for any combination, and if so, choose the combination that yields the largest reduction. To illustrate this process, consider Example 4. As shown in Figures 7a and 7b, three operations use vr0 , mult1, add2, and add3 scheduled in rows 1, 0, and 2, respectively. The fractional parts associated with Example 4 are shown in Figure 7c with three dierent cases for vr0 depending on which of the three operations is the last use of vr0 . By counting the number of fractional parts for each row and taking the maximum number over the three rows, we notice that the fractional part is 6, unless if add2 is the last-use operation of vr0 , in which case the fractional part is 5. When searching for a stage schedule with minimum integral MaxLive, in Figure 5, we initially obtained a schedule with an integral MaxLive of 8 and a fractional MaxLive of 6, as add2 was scheduled II cycles earlier and thus add3 was the last-use operation of vr0 . We then investigated whether there was a stage schedule that would schedule the add2 operation last to decrease the fractional MaxLive by 1 without increasing the integral MaxLive. Thus, we introduced two additional constraints, as presented in Equation (17), resulting in the two additional (bold) scheduling edges in the dependence graph shown in Figures 7d. Solving for a stage schedule with minimum integral MaxLive on this new dependence graph also results in a schedule with an integral MaxLive of 8, thus achieving a lower fractional MaxLive without increasing integral MaxLive. The resulting stage schedule is the one shown in Figure 5.
Algorithm 2 (MinReg Stage-Scheduler) The minimum total MaxLive for a kernel with a general de-
pendence graph, MRT, and set of functional unit latencies is: 1. Use Algorithm 1 to compute the minimum integral MaxLive. We refer to this resulting stage-schedule as the base solution. In case Algorithm 1 fails to produce a stage schedule, because of some critical recurrence cycles in the dependence graph, a new MRT-schedule must be generated, possibly with a dierent II . 2. If the underlying dependence graph is acyclic, the base solution results in the minimum MaxLive. 3. Otherwise, compute the total MaxLive associated with the base solution using Equation (16) and determine all sets of additional constraints, as de ned in Equation (17), that may decrease the fractional MaxLive. Interesting sets to be evaluated consist of all possible forcing combinations of one attractive last-use for some or all virtual registers whose fractional lifetime can be decreased. 4. Use Algorithm 1 repeatedly to compute the minimum integral MaxLive for the base system augmented by each of the sets of additional constraints, in turn. The schedule among these that achieves the smallest total MaxLive is optimum.
We prove the correctness of this algorithm with two theorems. The rst theorem validates the correctness of the solution for general dependence graphs, and the second for acyclic underlying graphs.
Theorem 3 Algorithm 2 produces the minimum MaxLive for a kernel with a general dependence graph, MRT, and set of latencies. This solution is guaranteed to be integer valued.
Proof: Since we test all the combinations of last-use operations that potentially reduce the fractional MaxLive, and since Algorithm 1 produces the minimal integral MaxLive for the augmented constraint set, the solutions of Algorithm 2 and the original unaugmented solution must include the minimum achievable total MaxLive for this MRT. It is easy to see that the additional set of constraints as de ned in Equation (17) also satis es the guaranteed integer solution property for LP problems. Therefore, the solution produced by an LP-solver for each use of Algorithm 1 is guaranteed to be integer valued. 2 15
While there is potentially an exponential number of additional constraints to test, this number in practice tends to be relatively small for several reasons. First, the number of tests is directly dependent on the number of cycles in the underlying dependence graph, which is usually a small number. Second, we have to consider only those cases that potentially reduce the fractional MaxLive. We do not need to consider use operations unless they may impact the MRT row that could govern the second term of Equation (16). Finally, we can use some inclusion properties to reduce the total number of tests.
Theorem 4 For dependence graphs with acyclic underlying graphs, the solution that minimizes the integral MaxLive also minimizes the total MaxLive for a given kernel, MRT, and set of latencies.
Proof: In general, the solution that minimizes the integral MaxLive may not be a solution with the minimal fractional register requirements. Recall, however, that the Algorithm 1 solutions for acyclic underlying graphs has all p = 0. Therefore, reducing a fractional lifetime for some vr requires setting some p to 1, increasing the integer lifetime for vr by 1. Therefore, reducing the fractional register requirements for vr cannot decrease the MaxLive of this virtual register. As the lifetime of each vr is independent of the lifetime of any other vr for an acyclic underlying graph, it is never advantageous to force a dierent last-use operation unless the total lifetime of the vr is decreased. Thus, we cannot reduce the register requirements of the base solution. 2 i;j
i
i;j
i
i
6 Measurements In this section, we investigate the register requirements of the two stage-scheduling algorithms of this paper. For purpose of comparison, we also present the register requirements of an ecient modulo scheduling algorithm that presently does not attempt to minimize the register requirements. Unfortunately, we are unable to provide a comparison with Hu's scheduling algorithm [11] since his machine model diers slightly from ours and his latest scheduler, presented in [11], was not available to us.
MinReg Stage-Scheduler: This scheduler minimizes MaxLive over all valid modulo schedules that share a given MRT, using Algorithm 2 which was presented in Section 5. The resulting schedule has the lowest achievable register requirements for the given machine, loop iteration, and MRT. MinBuf Stage-Scheduler: This scheduler minimizes integer MaxLive over all valid modulo schedules that share a given MRT, using Algorithm 1 which was presented in Section 4. The resulting schedule has the lowest achievable buer requirements for the given machine, loop iteration, and MRT. Buers must be reserved for a time interval that is a multiple of II, whereas registers may be reserved for arbitrary time periods. We derive and use in our comparisons the actual register requirements associated with these MinBuf stage schedules. Iterative Modulo Scheduler[25]: This scheduler has been designed to deal eciently with realistic machine models while producing schedules with near optimal steady-state throughput. Experimental ndings show that this algorithm requires the scheduling of only 59% more operations than does acyclic list scheduling. At the same time it results in schedules that are optimal in II for 96% of the loops in their benchmark [25]. In its current form, the scheduler does not attempt to minimize the register requirements of its schedules; however, the register requirements of its schedules may be reduced signi cantly by the simple heuristics presented in [15]. Operations Operations Underlying Register Scheduling in cycles cycles edges edges minimum 2 0 0 1 1 average 17.5 10.2 3.0 21.4 22.5 maximum 161 158 66 220 232 Table 2: Characteristics of the benchmark suite. 16
1 0.9 Fraction of loops scheduled
0.8 0.7 0.6 0.5 0.4 0.3
Schedule-Independent Lower Bound
0.2
MinReg Stage Scheduler
0.1
Iterative Modulo Scheduler
0 0
16
32
48
64
80 96 MaxLive
112
128
144
160
Figure 8: Register requirements of the benchmark suite. We use a benchmark of loops obtained from the Perfect Club [26], SPEC-89 [27], and the Livermore Fortran Kernels [28]. Our benchmark consists exclusively of innermost loops with no early exits, no procedure calls, and fewer than 30 basic blocks, as compiled by the Cydra 5 Fortran77 compiler [29]. The input to the three scheduling algorithms consists of the Fortran77 compiler intermediate representation after load-store elimination, recurrence back-substitution, and IF-conversion. Our benchmark suite consists of the 1327 loops successfully modulo scheduled by the Cydra 5 Fortran77 compiler. The MinReg Stage-Scheduler found an optimal solution for 1296 of the 1327 loops. For the remaining 31 loops, the scheduler exceeded the 512 augmented problem limit (Algorithm 2, Step 4) used in our measurements. Table 2 summarizes the principal characteristics of the benchmark suite, reporting the minimum, average, and maximum value for each entry. On average, more than 37% of the operations in a loop iteration belong to an underlying cycle and more than 62% of the operations in a loop iteration belong to an underlying cycle or self-loop in the underlying dependence graph. The machine model used in these experiments corresponds to the Cydra 5 machine. This choice was motivated by the availability of quality code for this machine. Also, the resource requirements of the Cydra 5 machine are complex [14], thus stressing the importance of good and robust scheduling algorithms. In particular, the machine con guration is the one used in [25] with 7 functional units (2 memory port units, 2 address generation units, 1 FP adder unit, 1 FP multiplier unit, and 1 branch unit). The Iterative Modulo Scheduler was also used to generate the MRT input for the MinReg and MinBuf Stage-Schedulers. We rst investigate the register requirements of the MinReg Stage Scheduler which results in the lowest register requirements over all valid modulo schedules that share a common MRT. Figure 8 presents the fraction of the loops in the benchmark suite that can be scheduled for a machine with a given number of registers without spilling and without increasing II. In this graph, the X-axis represents MaxLive and the Y-axis represents the fraction of loops scheduled on a machine with up to MaxLive physical registers. The \Schedule Independent Lower Bound" curve corresponds to the bound presented in [11] and is representative of the register requirements of a machine with unlimited resources. There is a signi cant gap between the MinReg Stage Scheduler curve and the Lower Bound curve which we believe is caused by two factors. First, the MinReg Stage Scheduler searches for a schedule with the lowest register requirements only among the modulo schedules that share the given MRT. Therefore, when the MRT given to the stage scheduler is suboptimal, the stage scheduler will nd a local minimum that may be signi cantly larger than 17
1
Fraction of Loops Scheduled
0.9 0.8 0.7 0.6 0.5 0.4 0.3
MinBuf Stage Scheduler
0.2
Iterative Modulo Scheduler
0.1 0 0
4
8
12
16
20 24 28 32 36 Additional MaxLive
40
44
48
52
Figure 9: Additional register requirements over MinReg Stage-Scheduler. the absolute minimum.2 The second factor is that the schedule-independent lower bound may be signi cantly too optimistic due to the complexity of some of the loops in the benchmark suite and the fact that this lower bound ignores the resource constraints of the machine. The lowest curve presents the register requirements of the Iterative Modulo Scheduler, which presently does not attempt to minimize the register requirements. We see that stage scheduling for minimum register requirements signi cantly reduces the register requirements of the loops in the benchmark suite considered. In general, stage scheduling will make a signi cant dierence for schedules in which the number of stages and the number of operations not on the critical path are not too small. Trying dierent MRTs for a given loop iteration is likely to become increasingly important as II increases. The gap between these last two curves also indicates the degree of improvement that might be achieved by stage scheduling heuristics. average maximum standard deviation
MinBuf Stage Scheduler MinReg Stage Scheduler 112.2 ms 23.5 sec 17.6 sec 108.3 min 816.6 ms 268.4 sec
Table 3: Computation time on a SPARC-20. Figure 9 illustrates the number of additional registers needed when the other scheduling algorithms are used instead of the MinReg Stage-Scheduler. The upper curve corresponds to the additional register requirements of the MinBuf Stage-Scheduler, which nds a stage-schedule with no additional register requirements in 94% of the loops. The MinBuf Stage-Scheduler schedules 98% of the loops with no more than 1 additional register but needs up to 7 additional registers for the remaining 2%. The Iterative Modulo Scheduler results in a stage-schedule with no additional register requirements in 42% of the loops, a surprisingly high number for a scheduler that does not attempt to minimize the register requirements. This result may be partially Experimental evidence on a subset of this benchmark suite, consisting of all the loops with no more than 12 operations and up to 5, for which an optimal modulo schedule was sought over all MRTs, showed, however, that the dierence between the local and absolute minimum was surprisingly small[20]. This result may not hold for larger loops. 2
II
18
explained by the fact that this scheduler does attempt to minimize the length of a schedule, which generally results in a stage-schedule with low register requirements along the critical path of that schedule. The computation time for the MinReg and MinBuf stage schedulers for the average and the maximum loop and the standard deviation are given in Table 3.
7 Conclusions Modulo scheduling is an ecient technique for exploiting instruction level parallelism in a variety of loops. It results in high performance code, but increases the register requirements. As the trend toward higher concurrency continues, whether due to using and exploiting machines with faster clocks and deeper pipelines, wider issue, or a combination of both, the register requirements will increase even more. As a result, scheduling algorithms that reduce register pressure while scheduling for high throughput are increasingly important. This paper presents an approach that schedules the operations of a loop iteration to achieve the minimum register requirements for a given a modulo reservation table. This method nds a stage-schedule with minimum register requirements for general dependence graphs on machines with nite resources. When known to be optimal, a linear-time algorithm is used; otherwise, a linear programming approach is used. We can also quickly determine when the faster algorithm is applicable. This paper demonstrates by example that selecting a good stage-schedule among all schedules that share the same MRT can result in lower register requirements. Measurements on a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels shows that the register requirements decrease by 24.5% on average when applying the optimal stage scheduler to the MRT-schedules of a registerinsensitive modulo scheduler. Though the general algorithm may be too slow for general use in a production compiler it may be extremely useful in special situations. This algorithm is also useful in evaluating the performance of register-sensitive modulo scheduling heuristics.
Acknowledgment This work was supported in part by the Oce of Naval Research under grant number N00014-93-1-0163 and by Hewlett-Packard. The authors would like to thank B. Ramakrishna Rau for his many insights, useful suggestions, and for providing the input data set.
References
[1] B. R. Rau and J. A. Fisher. Instruction-level parallel processing: History, overview, and perspective. In The Journal of Supercomputing, volume 7, pages 9{50, 1993. [2] P. Y. Hsu. Highly Concurrent Scalar Processing. PhD thesis, University of Illinois at Urbana-Champaign, 1986. [3] B. R. Rau, C. D. Glaeser, and R. L. Picard. Ecient code generation for horizontal architectures: Compiler techniques and architecture support. Proceedings of the Ninth Annual International Symposium on Computer Architecture, pages 131{139, 1982. [4] M. Lam. Software pipelining: An eective scheduling technique for VLIW machines. Proceedings of the ACM SIGPLAN'88 Conference on Programming Language Design and Implementation, pages 318{328, June 1988. [5] N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170{179, December 1992. [6] N. J. Warter. Modulo Scheduling with Isomorphic Control Transformations. PhD thesis, University of Illinois at Urbana-Champaign, 1994. [7] P. P. Tirumalai, M. Lee, and M. S. Schlansker. Parallelization of loops with exits on pipelined architectures. Proceedings of Supercomputing '90, pages 200{212, November 1990. [8] B. R. Rau, M. Lee, P. P. Tirumalai, and M. S. Schlansker. Register allocation for software pipelined loops. Proceedings of the ACM SIGPLAN'92 Conference on Programming Language Design and Implementation, pages 283{299, June 1992.
19
[9] W. Mangione-Smith, S. G. Abraham, and E. S. Davidson. Register requirements of pipelined processors. Proceedings of the International Conference on Supercomputing, pages 260{271, July 1992. [10] C. Eisenbeis and D. Windheiser. Optimal software pipelining in presence of resource constraints. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, August 1993. [11] R. A. Hu. Lifetime-sensitive modulo scheduling. Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation, pages 258{267, June 1993. [12] J. H. Patel and E. S. Davidson. Improving the throughput of a pipeline by insertion of delays. Proceedings of the Third Annual International Symposium on Computer Architecture, pages 159{164, 1976. [13] C. Eisenbeis, W. Jalby, and A. Lichnewsky. Squeezing more performance out of a Cray-2 by vector block scheduling. Proceedings of Supercomputing '88, pages 237{246, November 1988. [14] G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 mini-supercomputer: Architecture and implementation. In The Journal of Supercomputing, volume 7, pages 143{180, 1993. [15] A. E. Eichenberger and E. S. Davidson. Stage scheduling: A technique to reduce the register requirements of a modulo schedule. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 180{191, November 1995. [16] Q. Ning and G. R. Gao. A novel framework of register allocation for software pipelining. Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 29{42, 1993. [17] R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 85{94, November 1994. [18] J. Wang, A. Krall, and M.A. Ertl. Decomposed software pipelining with reduced register requirement. In Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, June 1995. [19] Dupont de Dinechin. Simplex scheduling: More than lifetime-sensitive instruction scheduling. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, 1994. [20] A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Optimum modulo schedules for minimum register requirements. Proceedings of the International Conference on Supercomputing, pages 31{40, July 1995. [21] J. Llosa, M. Valero, E. Ayguade, and A. Gonzalez. Hypernode reduction modulo scheduling. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 350{360, November 1995. [22] G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. Wiley, New York, 1988. [23] K. Paton. An algorithm for nding a fundamental set of cycles of a graph. Communications of the ACM, 12(9):514{518, September 1969. [24] S. Chaudhuri, R. A. Walker, and J. E. Mitchell. Analysing and exploiting the structure of the constraints in the ILP approach to the scheduling problem. IEEE Transactions on Very Large Scale Integration Systems, 2(4):456{471, December 1994. [25] B. R. Rau. Iterative Modulo Scheduling: An algorithm for software pipelining loops. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63{74, November 1994. [26] M. Berry et al. The Perfect Club Benchmarks: Eective performance evaluation of supercomputers. The International Journal of Supercomputer Applications, 3(3):5{40, Fall 1989. [27] J. Uniejewski. SPEC Benchmark Suite: Designed for today's advanced system. SPEC Newsletter, Fall 1989. [28] F. H. McMahon. The Livermore Fortran Kernels: A computer test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory, Livermore, California, 1986. [29] J. C. Dehnert and R. A. Towle. Compiling for the Cydra 5. In The Journal of Supercomputing, volume 7, pages 181{227, 1993.
20