Compiler Controlled Register Stack Management for the Intel Itanium ...

R Compiler Controlled Register Stack Management for the Intel R Architecture Itanium Alex Settle Daniel A. Connors Department of Electrical and Computer Engineering University of Colorado at Boulder {settle, dconnors}@colorado.edu

Gerolf Hoflehner Dan Lavery Intel Corporation Santa Clara, CA {gerolf.f.hoflehner,daniel.m.lavery}@intel.com January 20, 2004

Abstract

and allowing the compiler to manage the microarchitecture has enabled more of the transistor budget to be apIntel Itanium processors were designed with an on chip plied to building larger on chip caches. Since the procesregister stack engine (RSE) in order to reduce the over- sor relies on in-order compiler scheduling, the penalty for head related to procedure call boundaries. The RSE au- memory delays can play a more significant role than in tomatically preserves values stored in stacked registers comparable super scalar processor architectures. In fact, across procedure invocations. This architecture model studies have shown that for some applications, as much as significantly reduces the amount of spill code necessary 50% of the total execution time is spent servicing cache to maintain an application’s state, which in turn reduces misses [2]. This paper introduces a compiler directed apmemory traffic. Despite the benefits provided by the RSE, proach to reducing the system memory latency by targetCPU stalls due to register stack overflow and underflow ing the latency associated with RSE spills and fills. can contribute significantly to the overall execution time The contribution of this work is two-fold. First, it ofof many applications. This paper presents a method for fers a static compiler optimization that reduces the overall reducing these stalls by allowing the compiler to man- memory penalty for applications run on IPF processors. age the size of the stacked register set assigned to a given Second, it provides a detailed analysis of the run-time beprocedure. The proposed compiler optimization was inte- havior of these applications by studying the on chip pergrated into version 7.1 of the Intel Itanium compiler, and formance counters. tested on the SpecCInt2000 benchmark suite. The results indicate that allowing the compiler to manage RSE usage produces a 10% drop in processor stall cycles related to 1.1 RSE Background RSE traffic. The Itanium architecture has 128 architected integer reg-

1

Introduction

The Intel Itanium processor family (IPF), which uses an in order microarchitecture has proved to be successful for enterprise and high performance computing. The design philosophy of removing complexity from the hardware

isters r0-r127. Registers r0-r31 are static registers while the upper 96 registers, r32-r127, are stacked and dynamically mapped to a larger physical register file. Each procedure has an associated stack frame, which can house up to 96 stacked registers. The size of the stack frame and in turn the number of stacked registers is set at compile time by the alloc instruction. The register stack (RS) is a large register file that is used to preserve the contents of

In

Local

Out

that for a spill or fill to complete, the target memory address must be brought into the highest cache level if it is not already there. In this case it will displace another cache line which in turn could lead to future cache misses.

r32 r33 r34 r35 r36 r37 r38 r39 r40 foo() 1.2

r32 r33 r34 r35 r36 bar()

r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 r42 r43 r44

Physical Register File Figure 1: Register Mapping Example

the registers in a stack frame across function call boundaries. Hardware controls the mapping (Figure 1) between the virtual register names of the stack frame, r32-r127 and the physical register file. Settle et al [11] provides a detailed description of the RSE. The RSE reduces spill and fill operations in two ways. First, by having a large integer register file, spill code is only needed when a function requires more than 96 stacked registers. Secondly, in a traditional architecture, all of the registers used by a procedure must be saved to memory on function entry, then restored on exit. In this model, registers are saved to memory even if they are not used by the caller. As an example, if a function uses 20 registers, then on entry to the function there are 20 store instructions. With the RSE, no store instructions are required because these registers are automatically preserved on the register stack. Although the RSE reduces the amount of memory accesses due to spill and fill code, it can still contribute to program latency when the register stack overflows or underflows. When this occurs there are two issues which affect performance. The first is that the processor is stalled until the RSE spill or fill has completed. The second, is

Optimization Background

Compilers typically use the alloc instruction to set the stack frame size on procedure entry, however, the instruction set architecture (ISA) allows for this instruction to reset the stack frame size at any point in the procedure. This paper uses compile time analysis during the register allocation phase in order to assist stack frame reduction in a later optimization phase. The register allocator assigns low numbered registers to variables that have long live ranges. This leads to a compact stack frame, in that low numbered registers are likely to remain live for the duration of a procedure’s execution. This should leave higher numbered registers with values that have short lifetimes, thus making it more likely that these registers will be dead at a given point in the control flow. For a stack frame to be compact, there should be no internal fragmentation in the stack frame. For this to occur, there must be a stacked register rx such that all numbered registers below it are live, and all above it dead. The concept of register pressure is introduced to explain how the compiler groups basic blocks into regions that can have their associated stack frames resized by the alloc instruction. Register pressure describes the number of stacked registers that are live through a basic block. This information, combined with knowledge of the largest stacked register used by the block, allows the compiler to determine specific points in the control flow graph where alloc instructions can be inserted.

2

Related Work

There has been significant related work in the area of register allocation, but the RSE differs in that it provides autonomous fine-grained mapping of virtual to physical registers. The SPARC architecture uses register windows which are a similar concept to the RSE. Register windows allow function parameters to be passed through designated registers and also provide a small set of registers for storage local to a procedure [12]. The interface to these

ture [14] is a profile based approach at reducing the impact of RSE latency. This work uses a profile guided view of the control in order to weigh the cost of issuing alloc instructions against the cost of issuing spill code. In cold code regions issuing spill code should be relatively safe while freeing up registers allocated to a given stack frame. Settle at al introduced the idea of issuing alloc instructions before function calls in order to limit the number of stacked registers saved in the register stack [11]. This work revealed the potential for performance improvements by compiler controlled stack frame management, however, the results indicated that the alloc insertion was limited by internal fragmentation of the register stack frame.

3

Motivation RSE Contribution to Execution Time Base Case 0.05

Percentage Execution Time

0.04

0.03

0.02

0.01

300.twolf

256.bzip2

254.gap

255.vortex

252.eon

253.perlbmk

197.parser

181.mcf

186.crafty

175.vpr

176.gcc

0

164.gzip

registers is less configurable than for the Itanium stacked registers. In particular, the register window size is fixed at twenty four registers, while an Itanium register stack frame can be manipulated at run-time by an alloc instruction. Both the SPARC and the Itanium overlap the outgoing parameters of the caller with the incoming parameters of the callee [8]. Since the register stack is analogous to the procedure call stack, techniques for reducing the overhead of writing the procedure stack can be mapped to the RSE problem. The concept of shrink wrapping introduced by [4] directs the compiler to preserve callee saved registers on control flow paths that use the given registers. In this model, different execution paths through a function lead to a variation in the number of callee spill and fill operations. In the register stack frame domain, the problem is slightly different. Here, stacked registers can not be accounted for individually. Even though a register may not be live across a code section, it may still have to be contained in the stack frame if there are higher numbered stacked registers that are live across the code section. Thus, the effectiveness of the alloc optimization depends upon the way in which virtual registers are are mapped to their physical register names. Other studies have been done on the issue of increasing the register file size. One such study [10] proposes a caching mechanism used to access the set of registers in a larger register file. The important problem to address is that with a larger register file, access times increase making it more difficult to reference the register file in a single clock cycle. Providing a caching scheme or some other fast look-up procedure may make it possible to reap the rewards of a large register file without suffering a delay in access time [9]. Another compiler directed RSE optimization offers a brute force method of issuing the alloc instructions [5]. This approach analyzes stacked register usage per basic block, and issues the alloc instruction if the stack frame can be reduced for the basic block. The allocation algorithm differs in that it takes a more global view of the control flow graph and issues alloc instructions at strategic locations. Previous studies on the RSE [13] have proposed modifications to the RSE itself to provide a more efficient means of stacked register allocation. Inter Procedural Stacked Register Allocation for Itanium Like Architec-

Benchmarks

Figure 2: RSE Latency Relative to Execution Time Although the RSE improves system performance by reducing the memory latency associated with procedure calls, programs with large data sets can suffer from a significant penalty caused by RSE overflow. In the SpecCInt2000 benchmark suite several benchmarks have a significant RSE penalty. This paper focuses on improving application performance by reducing the contribution of the RSE to processor stall time. Figure 2 shows the percentage of program execution time that is spent servicing RSE spills and fills for the SpecCInt 2000 benchmarks. The benchmarks were compiled at optimization level 2

R 2 processor with 3MB of and were run on an Itanium level 3 cache. For many of the benchmarks there is little noticeable impact, however, on the larger applications such as 176.gcc, 186.crafty and 253.perlbmk, 255.vortex, and 252.eon the RSE accounts for between 2% and 4% of the total processor stall cycles.

Mitigating the negative affects of the RSE can be handled either in software or hardware. In the case of hardware, increasing the size of the physical register file would limit the number of RSE spills and fills, however design considerations such as single cycle access time of the register file and transistor budget must be considered [13]. Even with a hardware solution, there is still room for the compiler to manage the RSE behavior, which is consistent with the IA64 design philosophy. The flow graph of Figure 3 illustrates the property of stacked register pressure between neighboring basic blocks. In this example, the darker colored blocks indicate a high number of stacked registers are live across them. The group of basic blocks starting with block 240 and ending with block 906 only require a fraction of the original set of stacked registers in order to execute correctly. Thus, the stack frame could be reduced on entry into block 240 and restored on exit from block 906. If there is a call instruction contained in this region, then the number of registers saved to the physical register file at procedure entry can be reduced, which could in turn limit the number of RSE overflow and underflow delays.

Figure 3: Register Pressure Motivation

4.1

Register Live Range Sorting

The register allocator in the Intel Itanium compiler uses a region-based graph coloring algorithm [1][3]. In the context of register allocation, a region refers to either a loop or the collection of instructions in a function that do not belong to a loop. As an example, if a function were simply composed of a doubly nested loop, it would be par4 Approach titioned into three regions. One for the inner loop, one for the outer loop, and one for any remaining instructions In order for the compiler to produce code that can improve not contained in either of the loops. Figure 4 shows the RSE performance, two stages in the compilation process four phases of the register allocator: interference graph must interact. First, the register coloring phase must pro- construction, simplification, coloring, and region reconvide a compact stack frame. Once the stacked registers ciliation. The graph construction phase builds the interhave been sorted, the alloc optimization can adjust the ference, or conflict graph, for a region. Next the simplifistack frame size. By breaking the control flow graph into cation phase reduces the graph into a stack of live ranges regions based upon the stacked register pressure, alloc in- using node reduction. The coloring phase assigns a color, structions can be issued to adjust the frame size to fit the or hardware register, to each node on the stack starting at register demands of a given region. Although these two the top of stack. The final phase inserts instructions that optimization stages are independent, they are mutually are used to resolve any nodes that may overlap with two beneficial. or more regions. This is necessary, for example, when the

same node is assigned two different colors in two neighboring regions. The simplification phase removes a node from the interference graph when the number of conflicts for this node is lower than the number of colors available and puts it on the node stack. If a node has more conflicts than colors are available, spill code must be generated to accommodate the node. In this case the node is removed from both the interference graph and the node stack. The simplification phase can be modified so that live ranges which contain one or more calls are assigned lower numbered stacked registers. This technique is referred to as stacked register sorting in the remainder of the paper. Stacked register sorting can be further refined by taking into account parameters such as frequency weights, lengths and the number of calls in a live range. Interference Graph

Simplify Graph

Coloring

Reconcile Region

Figure 4: Region-based Register Allocator

4.2

Region Based Alloc Insertion

Analysis of the assembly code generated for the SpecCInt2000 benchmarks led to the idea of partitioning the control flow graph into regions of basic blocks with similar stacked register pressure. Stacked register pressure refers to the percentage of a function’s stacked registers that are live across a basic block. A basic block with x live stacked registers, for example, must have a stack frame large enough to accommodate these live registers during execution of the block. Flow graphs similar to Figure 3 were built for the most frequently executed functions in the SpecCInt2000 benchmarks. These graphs led to the realization that groups of basic blocks can execute with a stack frame size that is smaller than than the frame size defined on entry into the function. This observation suggests that by monitoring the life times of stacked registers, compilers should be able to compact the stack frame across groups of basic blocks. By grouping basic blocks into regions based upon their associated stacked register pressure, alloc instructions

that define a sufficient stack frame size can be inserted at the entry blocks of the region so that all blocks in the region have a large enough stack frame to execute correctly. Although this discussion focuses on using stacked register pressure to define regions, other criteria can be used. As an example, in order to reproduce the experiments in [11] the definition of a region boundary was changed to be the transition from a call block to a non-call block and vise versa. Similarly, the region definition can combine block execution frequency with register pressure.

4.3 4.3.1

Algorithm Description Region Formation

Before dividing the control flow graph into regions, the register pressure for each basic block must be determined. The register pressure of a block is defined to be percentage of a function’s stacked local registers that are live across a basic block. To calculate the stacked register pressure, the union of the live in and def sets, Reg Set(B), for each basic block is recorded. For basic block B, let Max Block Reg(B) be the highest numbered physical stacked register in Reg Set(B). Let Max Func Reg(F) be the highest numbered physical stacked register in function F. Then Reg Press(B) is defined as: 100 ∗ ((M ax Block Reg(B) − 32 + 1)/ (M ax F unc Reg(B) − 32 + 1)) The 32 in the denominator represents the first stacked register r32. For example, if the highest numbered stacked local register used in function foo() is r60 and the highest in basic block B is r40, then the register pressure at block B is: (40 − 32 + 1)/(60 − 32 + 1) = 31% In order to group basic blocks based on their register pressure, a threshold value is set at compile time. Basic blocks whose register pressures fall below the threshold are grouped into regions which can have a reduced stack frame while the remainder require the original set of stacked registers. Thus, if the threshold were set at 10%, then basic blocks that can execute with a stack frame reduced by 10% or more could be put in the same region. While the remaining basic blocks would require the full sized stack frame. After the register pressure for each block has been

r50 is first out arg

recorded, the next phase is to locate all of the control flow edges that represent transitions between regions. A transition occurs if an edge flows from a block of one region to another. Starting from the targets of each transition edge, the successor blocks are searched in depth first order until either an exit block or another transition edge is found. The basic blocks covered in this search comprise the region. To complete the region formation phase, all of the entry and exit blocks must be identified and labeled. 4.3.2

mov r50=r12


mov r40=r50 alloc mov r41=r2 call foo

Parameter Mapping

Once regions have been formed, the next step is to determine whether or not any of the region entry blocks have live-in procedure argument registers. If so, then the register must be renamed to accommodate a change in the register stack frame size. Figure 1 shows the layout of a register stack frame. The registers are ordered starting with incoming parameters followed by locals, then outgoing. If the size of the stack frame is reduced, then the largest local stacked register will will be moved to the left. The problem here is that a function expects its first argument to be in the register adjacent to the highest local. In order to generate correct code, the value in the outgoing register must be moved to the register adjacent to the highest local stacked register before an alloc instruction can be issued. Figure 5 illustrates this problem. Furthermore, if a region entry block has 2 or more predecessors from regions with different register pressures, the mapping problem increases in difficulty. In this case, the regions are merged and the above parameter register tests are performed on the resulting region. In the example of Figure 5, r50 is live on entry into the call block for function foo(). The compiler records this information and uses it to determine the destination and source registers of the associated mov instruction. The instruction mov r40=r50 is issued before the alloc in order to ensure that the register will remain in the active stack frame. In this example, if the mov had been inserted after the alloc, then register r50 would be out of scope and the value contained in this register would not be correct. In this example the stack frame size is decreasing from 30 to 20 registers. For this case the mov instruction used to copy the parameter register must be placed before the alloc instruction. If the frame size were increasing, however, then the mov would need to be placed after the alloc


mov r51=r5 call bar

Figure 5: Parameter Register Mapping instruction. This ensures that both the source and destination register values are preserved across the stack frame resizing operation. 4.3.3

Alloc Placement

Once regions have been formed, alloc instructions are issued at each region entry block to ensure that the stack frame is sized appropriately. Placing the alloc instructions at each region entry block guarantees that all paths through the region will have the correct register stack frame. For function entry blocks, the existing alloc instruction is reset according to the frame size of the entry region. In addition to the alloc and mov instructions discussed in section 4.3.2, nop instructions may need to be issued due to the ISA requirement that alloc instructions must be the first instruction in an instruction group[6]. Thus, on entry into each region, some combination of alloc, mov, and nop instructions must be issued in order to maintain correct code. This overhead should be taken into consideration when determining the metrics for region formation (Figure 9).

5

Results

The allocation optimization was integrated into version R Itanium R compiler. The experiments 7.1 of the Intel were run on an Itanium2 processor with 3MB of L3 cache under the Linux operating system. The optimization was

Register Sorting Affect on RSE base sort 2 base sort 6

register sorting algorithm. For example, the name r10b2 means that a block is part of a region if the stacked register pressure of the block is less than 10% of the maximum register pressure for a function, and it uses register sorting option 2. This policy is described in more detail in section 4.3.1. Lastly, the name “call xsy” shows that call blocks that can be reduced by at least “x” registers are region entry points. The “sy” field refers to the associated register sorting algorithm. Relative Processor Stall Time 1.08

Normalized Stall Cycles

activated along with the standard level 2 optimizations. This encompasses an aggressive set of low level optimizations including control and data speculation, and software pipelining. The SpecCInt2000 benchmarks were compiled using a range of allocation optimization flags. First, the performance of the stacked register sorting algorithms are compared against the base O2 optimizations. Next, different combinations of the region based allocation algorithm are tested with the register sorting algorithm. Then the region based allocation algorithm is set to model the work presented in [11], combined with the register sorting phase. Finally, the side effects of the allocation algorithm are investigated. For this study, the SpecCInt2000 benchmarks were run to completion using the ref input set, and performance counter statistics were recorded by the pfmon [7] performance monitoring tool.

1.06 1.04 1.02 1 0.98 0.96 0.94

1.2

256.bzip2

255.vortex

254.gap

253.perlbmk

252.eon

186.crafty

176.gcc

197.parser

1.4

181.mcf

1.6

175.vpr

164.gzip

0.92

1.8

Normalized Stall Cycles

base sort 2 r5s2

Benchmarks

1 0.8 0.6

Figure 7: Processor Stall Time

0.4 0.2

300.twolf

256.bzip2

254.gap

255.vortex

252.eon

253.perlbmk

197.parser

181.mcf

186.crafty

175.vpr

176.gcc

164.gzip

0

5.2

Register Sorting Performance

Benchmarks

The register sorting algorithm, as described in section 4.1 is used to assist the allocation optimization. Ordering the Figure 6: Register Sorting Affect on RSE registers so that the higher numbered registers are likely to be dead at a given point in the control flow provides the allocation algorithm better opportunities for reducing the stack frame size. The register sorting phase is run 5.1 Bar Graph Legend during register allocation and should not insert additional The bar graphs in this section present the results from instructions. However, this stage can lead to a change in benchmarks that were built using the settings discussed the number of stacked registers assigned to a function. If in section 5. This section explains the configuration flags this occurs in a function that stresses the RSE it can either and the associated encoding used in the graph legends. increase or decrease the amount of RSE memory transacBenchmarks built with the default optimization level are tions, depending on the direction of the change in stack marked as “base” in the bar graphs. Two stacked regis- frame size. ter sorting algorithms are tested and have the following The graph of Figure 6 shows this effect. This graph repnames: “base sort2” and “base sort6”. Next, the region resents the affect that the register sorting algorithms have based allocation compiler flags represent both the thresh- on the processor stall time. Across the benchmarks the old value for starting a new region and the accompanying impact on RSE stall time varies, with some benchmarks

such as 186.crafty showing a net reduction of 10%, while others such as 181.mcf, showing a net gain of nearly 20%. The data of Figure 6 suggest that the impact on the RSE from this optimization is unpredictable. Simply allocating one fewer stacked register to a frequently executed function that is part of a deep call chain could significantly reduce the net RSE traffic. Conversely, one additional stacked register on the same function could increase RSE traffic substantially. Thus, the purpose of this optimization stage is to organize the stacked registers to facilitate the task of changing the stack frame size at specific points in the control flow. Region Formation RSE Latency 1.3

r5s2 r10s2

are issued too liberally, this can lead to an increase in dynamic instruction count, which could in turn lead to a net slowdown. 5.3.1

Region Based Performance

According to the graph of Figure 8, the different region thresholds provide similar results, with a few important exceptions. The r5s2 case reduces the RSE latency by 5% and 10% respectively for the benchmarks 186.crafty and 253.perlbmk, while the r10s2 setting reduces the RSE latency by 40% for 300.twolf. This result suggests that there are critical locations in the control flow where the addition of an alloc instruction can significantly alter the RSE overflow and underflow activity.

1.1

Dynamic Instruction Count Relative to Base

1

1 0.995 0.99 0.985 0.98

300.twolf

256.bzip2

254.gap

252.eon

255.vortex

Figure 8: Region Formation RSE Latency

253.perlbmk

0.975

164.gzip

Benchmarks

1.01 1.005

197.parser

300.twolf

256.bzip2

254.gap

255.vortex

252.eon

253.perlbmk

197.parser

181.mcf

186.crafty

175.vpr

176.gcc

164.gzip

0.5

181.mcf

0.6

1.02 1.015

186.crafty

0.7

base sort 2 r5s2

175.vpr

0.8

1.025

176.gcc

0.9

Normalized Instruction Count

Normalized Latency

1.2

Benchmarks

Figure 9: Dynamic Instruction Count

5.3

Region Based Allocation Optimization

The region based allocation algorithm was tested using a range of control thresholds in order to identify the best performing cases. In this model, a region represents a group of basic blocks that have a register pressure that is less than a specified threshold. A threshold of 10%, for example, means that blocks which form a region either use fewer that 90% of the total number of registers assigned to the function or 90% and greater. Changing the threshold will lead to an associated change in the number of alloc instructions issued, which can affect performance in two ways. First, if the alloc instructions are issued in high frequency regions where there is a significant reduction in stack frame size, this will likely lead to a net program speed up. If on the other hand, alloc instructions

Figure 10 shows the relative performance gains due to the region based allocation algorithm. The benchmarks exhibit a 0.5% to 2% overall speedup for most cases, with the notable exceptions of 253.perlbmk and 255.vortex, which show a slowdown between 1.5% and 0.5% respectively. Since there is not a considerable increase in the dynamic instruction count for 253.perlbmk (Figure 9), the source of slowdown is likely caused by the way in which the additional alloc instructions are scheduled. Since alloc instructions must be the first instruction in an instruction group, it is possible that code with a high degree of ILP before the alloc instructions are inserted, could be more difficult to schedule resulting in an increase in stop bits and possibly a schedule with a lower degree of parallelism.

(10%), the region based approach leads to a larger stall cycle reduction.

RSE Stall Time Comparison 2

base sort 2 r5s2 call 4 s2

1.5

1

0.5

300.twolf

256.bzip2

254.gap

255.vortex

252.eon

253.perlbmk

197.parser

181.mcf

186.crafty

175.vpr

176.gcc

0

164.gzip

Normalized RSE Stall Cycles

The benchmark 186.crafty on the other hand, exhibits a speedup of 2% when built with the r5s2 flags. For this case it shows a reduction in RSE stall cycles of 20% (Figure 8). In the base configuration, RSE stall cycles account for 4% of the total processor stall time (Figure 2) thus, a reduction in RSE stalls of 20% should lead to a net speedup of at least 1%. In addition to limiting the amount of stall cycles associated with RSE latency, there was an associated improvement in the memory system performance, as indicated by the 5% drop in memory stall cycles (Figure 7). The combination of reductions in the RSE latency and memory latency leads to a net speedup of 2% (Figure 10).

Alloc Algorithm Performance

Normalized Cycle Count

1.03

Figure 11: RSE Time Comparison

r5s2 call4 s2

5.5

1.02 1.01 1 0.99 0.98

300.twolf

256.bzip2

254.gap

255.vortex

252.eon

253.perlbmk

197.parser

181.mcf

186.crafty

175.vpr

176.gcc

164.gzip

0.97

Benchmarks

Figure 10: Alloc Algorithm Speedup

5.4

Performance Analysis

The results indicate that reducing RSE traffic can improve an application’s performance, but the cost associated with this optimization must be carefully weighed against its benefit. There appear to be two significant side affects to resizing a function’s stack frame using software techniques. The first is the potential increase in the dynamic instruction count of a program, specifically if an alloc is inserted in the body of a loop with a high trip count. The second is the potential to complicate the software schedule by adding instructions that have special requirements, such as being the first instruction in a group.

Call Block Regions

The definition of a region can be changed by setting a compiler flag so that a call block defines the boundary between two regions. In this mode the allocation algorithm is nearly the same as the allocation algorithm in [11]. For this case, the register stack frame is reduced before a call block and restored in the preceding block. In this way fewer registers will need to be saved to the stacked register file at call boundaries. According to Figure 11 the call block region combined with stacked register sorting produces a similar affect on the RSE stall cycles as does the region based approach. It is interesting to note however, that some benchmarks such as 300.twolf reduce the RSE stall cycles significantly more than the region based approach (35%), while in other cases such as 253.perlbmk

6

Conclusion

This paper studies a set of alloc insertion optimizations designed to improve system performance by reducing the stall time associated with RSE spills and fills. Based upon the experiments, the region based allocation optimization produced the best performance gains and specifically, it lowered the RSE stall cycles for all of the benchmarks by an average of 10%. The experiments indicate that the allocation algorithm is sensitive to the additional instructions required to control the stack frame size. As such, future work should be directed toward limiting the number of alloc instructions issued, while maintaining the necessary fine grain control over the register stack frame size.

The region based allocation algorithm can be extended by building regions based upon basic block execution frequency. Thus, by making a small change to the front end of the optimization, it can be tuned to use profiling information to direct region formation. An ideal region would consist of frequently executed blocks that require a small set of the allocated stacked registers in order to execute. It would be interesting to test this optimization on a set of applications that stress the RSE to a greater degree than the SpecCInt2000 benchmarks. Improving performance by reducing the spills and fills associated with the RSE provides performance gains for benchmarks where the RSE contributes significantly to the overall stall time. Due to the cost consideration of the additional instructions, one must be careful to use this optimizations only on applications where the RSE latency is significant. Otherwise, the additional alloc instructions could increase the processor workload without contributing to latency reduction. In benchmarks where the RSE does contribute to slow down, the allocation optimization can improve performance by several percentage points. First, by limiting the number of RSE stall cycles, and second by reducing the probability for cache misses that would occur after an RSE spill or fill address is brought into the cache. In addition, the allocation optimization provides a predictable method for reducing the RSE overhead for a given application. Thus, compiler controlled run-time RSE management is an extensible and effective method for mitigating the latency associated with the IPF register stack engine.

References [1] J. Bharadwaj, W. Y. Chen, W. Chuang, G. Hoflehner, K. Menezes, K. Muthukumar, and J. Pierce. The intel ia-64 compiler code generator. IEE Micro, 20(5):44–52, September, October 2000. [2] I. Bratt, A. Settle, and D. A. Connors. Predicate-based transformations to eliminate control and data-irrelevant cache misses. In Proceedings of the First Workshop on Explicitly Parallel Instruction Computing Architectures and Compiler Techniques, pages 11–22, December 2001. [3] G. J. Chaitin. Register allocation and spilling via graph coloring. In Proceedings of the ACM SIGPLAN 82 Symp. on Compiler Construction, pages 98–105, June 1982.

[4] F. C. Chow. Minimizing register usage penalty at procedure calls. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, pages 85–94, Atlanta, GA, June 1988. [5] A. Douillet, J. N. Amaral, and G. R. Gao. Fine-grain stacked register allocation for the itanium architecture. In 15th Workshop on Languages and Compilers for Parallel Computing (LCPC), 2002. [6] Intel Corporation. Intel IA-64 Architecture Software Developer’s Manual. Santa Clara, CA, 2002. [7] S. Jarp. A methodology for using the itanium 2 performance counters for bottleneck analysis. Technical report, Hewlett-Packard Labs, August 2002. http://www.hpl.hp.com/research/linux/perfmon/. [8] D. Keppel. Register windows and user-space threads on the SPARC. Technical Report TR-91-08-01, 1991. [9] T. Kiyohara, S. M. W. Chen, R. Bringmann, R. Hank, S. Anik, and W. Hwu. Register connection: A new approach to adding registers into instruction set architectures. In Proceedings of the 20th International Symposium on Computer Architecture, pages 247–256, May 1993. [10] M. Postiff, D. Greene, S. Raasch, and T. N. Mudge. Integrating superscalar processor components to implement register caching. In International Conference on Supercomputing, pages 348–357, 2001. [11] A. Settle, D. A. Connors, G. Hoflehner, and D. Lavery. Optimization for the intel itanium architecture register stack. In Proceedings of the international symposium on Code generation and optimization, pages 115–124. IEEE Computer Society, 2003. [12] D. L. Weaver and T. Germond. The SPARC Architecture Manual. SPARC International, Inc., Menlo Park, CA, 1994. [13] R. D. Weldon, S. S. Chang, H. Wang, G. Hoflehner, P. H. Wang, D. Lavery, and J. P. Shen. Quantitative evaluation of the register stack engine and optimizations for future itanium processors. In Proceedings of the Sixth Annual Workshop on Interaction between Compilers and Computer Architectures, Santa Clara, CA 95052, July 2002. [14] L. Yang, S. Chan, G. R. Gao, R. Ju, G.-Y. Lueh, and Z. Zhang. Inter-procedural stacked register allocation for itanium like architecture. In Proceedings of the 17th annual international conference on Supercomputing, pages 215–225. ACM Press, 2003.