A Selective Caching Technique - Semantic Scholar

1 downloads 0 Views 282KB Size Report
block to the processor. Extensive simulation studies on caches of size 1k to 64k bytes demonstrate that for a variety of SPEC programs, the proposed selective ...
A Selective Caching Technique L. John and R. Radhakrishnan

Electrical and Computer Engineering Department The University of Texas at Austin Austin, TX 78712 [email protected] [email protected]

Abstract

Ecient caches are extremely important for achieving good performance from modern high performance processors. Conventional cache architectures exploit locality, but do so rather blindly. Since all references are forced through the cache, every miss will result in a new block of information entering the cache. This reduces the e ectiveness of the cache because infrequent items can displace frequent items from the cache. This paper presents a simple selective caching scheme for improving cache performance. The scheme uses information on usage counts to selectively exclude portions of the program or data from entering the cache. The cache should have a mechanism for excluding certain elements designated by the compiler. In this paper, we present the proposed selective caching scheme that permits only the high usage sections of the code/data to be cached, and evaluate the performance of the scheme. Cache bypassing improves performance by reducing memory bandwidth requirements because the bypassed items do not result in the fetching of the entire cache block to the processor. Extensive simulation studies on caches of size 1k to 64k bytes demonstrate that for a variety of SPEC programs, the proposed selective caching scheme can improve system performance. Keywords: Cache conscious programming, cache optimization, cache bypassing, static cache exclusion policy, program behavior, program pro ling.

* This work was supported in part by the National Science Foundation Grant CCR-0624378 and a grant from Oak Ridge Associated Universities.

1 Introduction Caching is a time-tested mechanism to alleviate memory latency [19], and is widely used to bridge the processor memory speed disparity in instruction and data fetching. Traditionally, caches are transparent to the programmer, but recently the trend has been to do `cache conscious programming' where programs are optimized considering cache parameters into account [29]. This paper focuses on improving the performance of caches using compile-time optimizations. The average access time for a memory reference in a system with cache depends on the hit rate (or the miss rate) and the miss penalty. Hence improving cache performance involves minimizing the miss rate and the miss penalty. The miss rate can often be reduced by increasing the cache size, cache block size and/or the associativity of the cache. It is dicult to increase the cache size or associativity very much while matching the cache hit access time to the clock speeds of modern fast processors. Increasing the block size often reduces the miss rate, however results in increasing the miss penalty. In this paper we present a selective caching scheme that employs bypassing or cache exclusion for infrequent items. In conventional caches, every reference is forced through the cache and every miss results in a new block of information entering the cache. This new block of information could be a piece of rarely used data, but it may replace a piece of heavily used data and result in additional misses. Thus high usage instructions may be knocked o by infrequent instructions, resulting in increased miss rates and lower cache performance. It is our hypothesis that the cache performance can be improved by prohibiting infrequent items from entering the cache. The excluded items are directly fetched to the processor. The items that are bypassed do not result in transferring an entire block from the main memory (or secondary cache) to the higher level cache and hence reduces the memory bandwidth requirement and bus trac. Memory bandwidth is at a premium in most modern processors and hence the proposed approach has a great potential in improving the system performance. Cache bypassing is not a new idea. It has been studied in the past by McFarling [32], Chi and Dietz [10], Abraham et. al. [1], Gonzalez et. al. [14], Tyson et. al [44] etc. McFarling's scheme [32] is for instruction caches whereas the proposals in [10], [1], [14] and [44] are for data caches. McFarling [32] proposed a mechanism to dynamically decide whether an instruction causes con icts and should be excluded from the cache. The decision whether 1

an instruction should be replaced when a competing instruction is needed or whether the competing instruction should just be passed on to the processor bypassing the cache is made by a special nite state machine (FSM) in conjunction with two or more state bits associated with each cache block. Chi and Dietz [10] showed that bypassing the cache can avoid cache pollution and improve performance for data loads and stores. Abraham et. al. [1] and Tyson et. al. [44] presented schemes where they pro le the cache performance of a program and identify the instructions which result in the most data misses and exclude them from the cache. In all these cases, results have been good. However, it is not known what kind of performance can be obtained by a simple cache exclusion scheme based on usage frequencies. Investigation of this is what we undertake in this paper. While McFarling's dynamic exclusion policy [32] yields good performance for instruction caches, there is the complexity associated with the FSM and the additional state bits. Tyson et. al.'s scheme yields good performance, however, necessitates accurate pro ling of the cache performance of the program for the particular cache under consideration. The scheme we evaluate in this paper is a simple cache bypass scheme based on frequency of usage of basic blocks within a program. It does not require complex hardware as in [32] or detailed cache pro ling as in [44]. The proposed scheme requires only a pro le le that contains basic block execution counts as in McFarling's scheme [31]. Generation of such a pro le le with block counts is simple as demonstrated by past research [31] [42]. Simple block counts also permit using an average if the behavior of the program varies widely depending on the input. To our knowledge, no past study quanti es the performance improvement possible by simply excluding the low-usage references from the cache. The scheme requires the cache to have a mechanism to exclude certain instructions designated by the compiler. Although such features are not very common, many commercial processors including the Convex C-1 [46] and Intel i860 [21] provide noncaching load instructions or the ability to explicitly allocate or deallocate individual cache lines. DEC Alpha [6] provides a means for specifying some portions of the memory as non-cacheable (though this ability is not at user level). Thus it is clear that incorporation of a cache bypassing scheme for instructions is well within the capability of modern processors. Though some processors have similar capabilities, not sucient studies have been done to quantify the performance improvement obtained by allowing the low frequency usage code to bypass the cache. 2

In section 2, the proposed selective caching scheme is described. In section 3, a quantitative analysis of the performance of the selective caching scheme based on trace-driven simulation is presented. Section 4 presents summary and concluding remarks.

2 The Selective Caching Scheme The proposed scheme can be applied for instruction and data caches. For instruction caches, pro ling is done on programs in order to identify the heavily used basic blocks. The basic block usage frequencies are analyzed and classi ed into High Usage (HU), Medium Usage (MU) and Low Usage (LU) categories as illustrated in Fig. 1. While applying the scheme to data caches, the usage frequencies of various data structures referenced by the program are analyzed and the heavily referenced data structures identi ed. In practice, it will be dicult to nd clear demarcations between HU, MU, or LU categories. (However, this is true of the code layout schemes in Torrellas et al as well [42]. They took a size of 1k for their SelfCon ree area which is the equivalent of our heavy-use (HU) area.) The threshold for setting each demarcation does not need to be xed, but it could be variable depending on the cache size and program size. It may be noted that if the code/data set is extremely small compared to the size of the cache, the entire program may be marked as HU category. If there is only one level of cache, there need be only HU and LU categories. Based on the pro le input, the compiler marks HU instructions as cacheable to level 1, MU instructions as cacheable to level 2, LU instructions as cacheable to level-3 etc. The essence of the scheme is to allow only the heavy usage sections of the code to be cached and to bypass rarely used code/data directly to the instruction bu er/instruction register or data registers in the processor. HU

MU

LU

Figure 1: Classi cation of code into categories based on frequency of usage. HU - Heavy Usage, MU - Medium Usage, LU - Low Usage. Fig.2 illustrates a system which can implement the proposed selective caching scheme. 3

Processor Instruction Buffer or Instruction Register On-chip cache

Off-chip cache

Main Memory

Disk

Figure 2: System with bypassing cache Our selective caching scheme allows only HU elements to enter the top layer of cache and only HU and MU elements to enter the second layer. i. e. MU elements will bypass the on-chip cache and LU elements will bypass both on-chip and o -chip caches. The elements to be excluded from the cache can be indicated by several methods - one being use of a cache-on/cache-o instruction pair, turning o the cache right before entering the excluded sections. The problem with the usage of cache-on/cache-o instructions is that each instruction requires a cycle to issue. Another option is to use bits to indicate the highest level in the memory hierarchy to which a particular instruction or data for a load instruction can be promoted. This would necessitate spending one or two bits in the instruction format for this purpose. Yet another option is to reorganize code/data set in descending order of usage and tag every virtual memory page with a bit indicating the highest level this page might get promoted to. In this case, thus HU and MU categories are in chunks of multiples of the virtual memory page size. Fig. 3 is a simpli ed diagram to indicate the associated virtual to physical memory mapping and the page table. Among other bookkeeping information, a 4

few additional bits (two per page here) are added to indicate the highest level to which the data on this page should be promoted to. For example, for the bit assignments indicated in Fig. 3, if the bits are 00 for a particular page, all references in this page are cached to all levels of the cache; if the bits are 01, the references on the particular page are cached only upto the second level of cache and if the bits are 10, they are not cached to either of the caches. This tagging is performed at compilation considering the cache sizes. Instead of pro ling the program to obtain usage frequencies, it is possible to obtain similar results by programmer speci ed directives because often programmers know which segments of the code (or which data structures) are used most frequently. Programmers can be allowed to specify their choice through directives. It is assumed that the relevant parts of the page table will be in the TLB, so that no extra `memory' accesses are involved in accessing the `Highest Cacheable Bits'. The proposed scheme emphasizes temporal locality, however it does not ignore spatial locality. Use of large block sizes can help to exploit spatial locality in the frequently used sections of the code. Non-caching of rarely used items will mean that spatial locality in them cannot be exploited, but if 10% of the static code/data is cached, that might lead to almost 90% of the dynamic references being cached and the spatial locality in this 90% will be captured by the cache. Caching of LU items is certainly bene cial in capturing spatial locality in the other 10%, but similar improvement can be obtained if the processor supports non-blocking instruction fetch, especially in conjunction with a wide instruction bu er. Let us examine how e ective a scheme based on static instruction frequencies will be, for some of the common reference patterns. Reproducing an example that McFarling [32] used to illustrate his dynamic cache exclusion scheme, consider a case where two elements A and B within a single loop compete for space in the cache. If the loop is executed 10 times, the memory access pattern may be represented as (AB )10 , where the superscript denotes the frequency of usage of the particular instruction. If we allow both the elements to enter the cache, each instruction will knock each other out of the cache and neither hits. Hence the behavior of a conventional cache is (Am Bm)10 where a subscript of m denotes a miss and a subscript h would indicate a hit. Hence the

5

Bits to indicate Highest Cacheable Level 00 01 10

First Level Cache Second Level Cache Main Memory Main Memory

Highest Cacheable Level

V1

01

0

01

1

01

1

00

1

10

1

10

0

00

0 Disk

Page Table

Figure 3: Cache exclusion information is stored along with valid bits in the page table miss rate of a conventional cache is

Mconv = 100%: Instead of allowing every element to enter the cache, let us keep one element, say A, in the cache and exclude the other one. The behavior of the bypassing cache is,

Am Bm (AhBm)9 and the miss rate of the bypassing cache is

Mbyp = 55%: 6

Reference Pattern

Conventional cache behavior

Bypassing cache behavior

Am A9h Bm )10

9 Am A9h Bm (A10 h Bm )

Am A9h Bm Bh9 )10

10 9 Am A9h Bm10 (A10 h Bm )

Am Bm Bh )10

Am Bm Bh (Am Bh Bh )9

Bm Am Bm (Bh Am Bm )9

Bm Am Bh (Bh Am Bh )9

Am A2h Bm )10

Am A2h Bm (A3h Bm )9

Am Bm Cm )10

Am Bm Cm (Ah Bm Cm )9

Am Bm Bh10 )10

Am Bm Bh10 (Am Bh11 )9

Bm Am Bm Bh9 (Bh Am Bm Bh9 )10

Bm Am Bh10 (Bh Am Bh10 )9

Am A2h Bm Bh9 )10

A3m Bm Bh9 (A3m Bh10 )9

Am Bm Cm Ch9 )10

Am Bm Cm Ch9 (Am Bm Ch10 )9

Am A9h Bm Bh9 Cm Ch9 Dm Dh9 Em Eh9 )10

Am A9h Bm10 Cm10 Dm10 Em10 10 10 10 10 9 (A10 h Bm Cm Dm Em )

A B )10

( 10

1

(

A B 10 )10

( 10

2

(

ABB )10

3

(

4

(

(

BAB )10

5

AAAB )10

(

(

ABC )10

(

ABB 10 )10

(

(

6 7

(

8

(

BAB 10 )10

9

AAAB 10 )10

(

ABC 10 )10

(

(

10

(

A B 10 C 10 D10 E 10 )10

11

( 10

12

(

ABACADAEAF )10

(

Am Bm Am Cm Am Dm Am Em Am Fm )10

(

Am Bm Ah Cm Ah Dm Ah Em Ah Fm Ah Bm Ah Cm Ah Dm Ah Em Ah Fm )9

(

Table 1: Caching behaviors of conventional cache and bypassing cache for a few reference patterns. A, B, C, D, and E are assumed to con ict with each other. The element with the highest frequency is chosen to be cached. If several elements have the same high frequency, an arbitrary element is chosen. In the 12 sequences presented in the table, the cached element is A, A, B, B, A, A, B, B, B, C, A and A respectively. Table 1 illustrates a comparison of the behavior of conventional cache and bypassing cache for several other reference patterns. A comparison of the corresponding miss ratios is presented in Table 2. From Table 1 and Table 2, it is seen that except patterns 2, 9 and 11, all patterns yield improved performance with the proposed static cache exclusion policy. Sometimes, caching the entire program may be more fruitful than bypassing the cache for part of the program. For instance, if there is a sequence such as A10000 B 10000 AB , normal caching would have only 4 misses whereas if A or B is bypassed, approximately half the references will be misses. Two other examples are sequences 2 and 9 in Table 2. Bypassing will be ecient if there are 7

Reference Pattern 1 (A10B )10 2 (A10 B 10 )10 3 (ABB )10 4 (BAB )10 5 (AAAB )10 6 (ABC )10 7 (ABB 10 )10 8 (BAB 10 )10 9 (AAAB 10 )10 10 (ABC 10 )10 11 (A10 B 10C 10 D10E 10 )10 12 (ABACADAEAF )10

Mconv Mbyp 18 10 67 70 50 100 16.7 17.5 15.4 25 10 100

10 50.5 37 37 27.5 70 9.2 9.2 23.8 17.5 80 49

Table 2: Miss ratios of conventional cache and bypassing cache for the reference patterns in Table 1. drastic transitions displacing the prominent working set from the higher level in the memory hierarchy eg: sequence 12.

3 Simulation Studies The performance of the proposed selective caching scheme for real programs will depend on the actual memory referencing patterns in the particular program. In this section, we present results from simulation studies that investigate the e ectiveness of the proposed scheme for various programs taken from the SPEC suite. Trace-driven simulation, which has the advantage of fully and accurately re ecting real instruction streams, is used for the study. For our evaluation, we employed an architecture simulator which processes address traces generated by pixie [34] [39] on a DEC5000 workstation which uses the MIPS R3000 processor.

3.1 Performance Metrics Performance is evaluated using total memory access time and a speedup de ned as the ratio of the total memory access time with conventional cache to the total access time with selective caching. If h is the number of accesses which hit in the cache, m is the number of 8

misses, th is the hit access time and tm is the miss penalty, then

ttotal = h  th + m  tm : We assume the cache access time th to be 1 cycle, and the miss penalty tm to be 5(blocksize in bytes=4): To facilitate easy comparison, we compute a speedup gure as the ratio of the e ective access times without and with the selective caching scheme.

with ? normal ? cache Speedup = t teff??with ? selective ? caching eff

3.2 Benchmarks and Trace generation The benchmarks consisted of 10 programs from the SPEC92 suite, compress, xlisp, eqntott, espresso, doduc, ora, swm256, ear, alvinn, and tomcatv. The address traces were generated by a pixie-based tool derived from [39], on a DEC5000 workstation which uses the MIPS R3000 processor. The SPEC92 programs were compiled with default Make les and the pixi ed executables of the programs were generated and traced. In order to simulate the entire traces of the regular input les for all the di erent cases, several months of simulations will be required. Hence we decided to perform the simulations for only 30 million references or less. If SPEC provided a short input with less than 30 million references, we used that and traced the program entirely. The smaller inputs were used because then we could trace the entire application and see e ects due to operating system activity as well. If such a small input was not available, the rst 30 million references were traced. However, we believe that using small inputs or partial runs will not a ect the results on instruction caches because even the smaller inputs exercised most of the static code and in reference patterns very similar to the large runs. (We veri ed this by comparing the usage count statistics of runs with large and small inputs.) In the case of data caching, there are more di erences in the data set with smaller input and larger input; however, the larger input sets will only increase the con icts in the cache and as will be demonstrated in the forthcoming sections, our scheme works better if there are more con icts. Hence we have extremely high level of con dence in the results presented in this paper. Seven out of the 10 traces are entire programs. Tables 2 and 3 illustrate the locality characteristics of instruction and data 9

Benchmark xlisp ear espresso tomcatv eqntott ora doduc alvinn

1k 49.65 10.65 58.65 22.68 13.93 98.99 46.68 95.02

4k 72.33 28.90 79.60 29.75 23.51 99.94 69.39 95.34

16k 88.21 95.88 96.0 56.24 43 100 93.75 96.48

64k 99.98 99.10 99.9 100 77.88 100 100 99.28

Table 3: This table illustrates the locality characteristics of the data traces used. For each benchmark, the percentage of dynamic references arising from the most frequently used data set of size of 1k, 4k, 16k and 64 kbytes are shown. accessing required by the benchmark programs. These tables would justify the cache sizes chosen for the simulations (explained in the forthcoming section). E orts were made to use all the 20 programs in the suite, however we encountered some problems with our tracing tool and were not able to generate the other traces.

3.3 Con gurations simulated We used cache sizes of 4k, 16k and 64k bytes for data cache, and performed experiments for block sizes of 32 bytes and 64 bytes. For instruction caches, since the footprint of the code of most SPEC programs ts within caches of size greater than 4k, (the reader may observe this in Table 4) we used 1k, 2k and 4k caches. For larger instruction caches, the speedups would be one because the threshold is often the full program size or a major part of it. A miss penalty of 40 cycles is assumed considering the block size of 32 bytes, bus width of 32 bytes, and access time of 5 processor cycles. For cache with 64 byte block size, the miss penalty is assumed to be 80 cycles.

3.4 The simulation process The rst step in the simulation process is the generation of a le containing the usage counts of instructions and data. The virtual memory page size is assumed to be 1k in our experiments. Based on the usage le, the most frequently used 256 instructions are assigned the page number one (because MIPS instruction size equals 4 bytes, a 1k page can contain 256 instructions), the next most frequently used 256 instructions are assigned page number 10

Benchmark xlisp ear espresso tomcatv eqntott ora doduc alvinn compress swm

1k 57.72 90.79 36.73 75.13 98.32 95.84 31.08 57.72 99.95 87.41

4k 99.99 99.24 71.91 98.22 99.77 99.91 48.74 99.99 99.98 99.92

16k 100 99.99 94.48 99.99 100 99.98 70.07 100 100 99.99

64k 100 100 99.99 100 100 100 99.94 100 100 100

Table 4: This table illustrates the locality characteristics of the instruction traces used. For each benchmark, the percentage of dynamic references arising from code size of 1k, 4k, 16k and 64 kbytes are shown. two and so on. The simulator is written in C and runs under Unix. The simulator obtains the parameters such as cache sizes, block sizes, associativity, memory access time etc from a parameter le. In order to facilitate the simulation of selective caching, a page threshold value is added to the parameter le that contains the simulation parameters. The simulator compares the page threshold value for the current run with the page number corresponding to each reference. If the page number is below the threshold, the corresponding reference is cached; if it is above the threshold, the corresponding reference is excluded from the cache and fed directly to the processor. The simulator results for the normal cache was compared to results from dinero [45] for several cases and found to be identical.

3.5 Performance of data traces These set of experiments consisted of marking the most frequently used 10% of the data set as HU (cacheable) and evaluating the cache performance. Data falling in the HU range are cached based on the conventional mapping and replacement policies. If 10% of the data set is smaller than the cache size, the cacheable section of code is made equal to the cache size. Fig 4 illustrates the total memory access time for the data traces. The percentage speedups averaged over all the benchmark programs are presented in Table 5. If the speedup is S , the percentage speedup is calculated as (S ? 1)  100. There is performance improvement in the majority of the cases that we studied. In 11

3.00e+8

Access Time (Cycles)

Cache Size : 4Kbytes

32 32 64 64

bytes bytes bytes bytes

(Normal) (Selective) (Normal) (Selective)

2.00e+8

1.00e+8

0.00e+0 ear

espresso

eqnt_o5

ora

alvinn

xlisp

tomcatv

doduc

8.00e+7

Access Time (Cycles)

Cache Size : 16Kbytes 6.00e+7

4.00e+7

2.00e+7

0.00e+0 ear

espresso

Access Time (Cycles)

2.00e+7

eqnt_o5

ora

alvinn

xlisp

tomcatv

doduc

xlisp

tomcatv

doduc

Cache Size : 64Kbytes

1.00e+7

0.00e+0 ear

espresso

eqnt_o5

ora alvinn Benchmarks

Figure 4: Total memory access time of the selective caching scheme and normal caching for data traces. For each benchmark, block sizes of 32 bytes and 64 bytes are considered.

12

Cache Size 4k 16k 64k

%Speedup %Speedup 32 byte blocks 64 byte blocks 22 41 12 16 12 13

Table 5: Percentage Speedup for data cache averaged over all the benchmarks general, the impact of bypassing is higher at lower cache sizes because there are more misses and there is more room for improvement. There is no performance improvement from selective caching once the cache is big enough to contain at least half of the program because there is no cache pressure. Whenever there is a reasonable cache pressure, selective caching alleviates harmful e ects of con ict and capacity misses and helps to obtain higher speedups. Cache block size also has an impact on the performance of the proposed scheme. In general, the speedup is higher for higher block sizes. In normal caches, increasing the block size allows to exploit more spatial locality, but may decrease the performance due to higher bus trac, higher miss penalty and loss of more useful data in `spurious' replacements. (By `spurious' replacement, we mean replacement of an item to be used in future by an infrequent item.) In general, the speedup becomes higher as block size increases. One may view cache bypassing as a means to curb the undesirable e ects of higher block size.

3.6 Performance of instruction traces These set of experiments consisted of marking the most frequently used 10% of the code as HU (cacheable) and evaluating the cache performance. Instructions falling in the HU range are cached based on the conventional mapping and replacement policies. If 10% of the code is smaller than the cache size, the cacheable section of code is made equal to the cache size. Fig 5 illustrates the total memory access time for the instruction traces. The percentage speedups over all the benchmark programs are presented in Table 6. The impact of cache size and block size on the performance selective caching is seen to be similar in both instruction and data caching. Another observation is that the speedups are more signi cant for data caches, in comparison to instruction caches. The results obtained here are from a very pessimistic evaluation. It did not include exploitation of spatial locality 13

Access Time (Cycles)

3.00e+8 32 32 64 64

2.00e+8

bytes bytes bytes bytes

Cache Size : 1Kbytes

(Normal) (Selective) (Normal) (Selective)

1.00e+8

0.00e+0 ear

espresso eqnt_o5

ora

alvinn

xlisp

swmi

doduc

tomcatv compress

3.00e+8

Access Time (Cycles)

Cache Size : 2Kbytes

2.00e+8

1.00e+8

0.00e+0 ear

espresso eqnt_o5

ora

alvinn

xlisp

swmi

doduc

tomcatv compress

2.00e+8

Access Time (Cycles)

Cache Size : 4Kbytes

1.00e+8

0.00e+0 ear

espresso eqnt_o5

ora

alvinn xlisp Benchmarks

swmi

doduc

tomcatv compress

Figure 5: Total memory access time of the selective caching scheme and conventional caching for instruction traces. The cache sizes considered are small because most of the programs can be contained entirely in 8 kbyte or larger caches.

14

Cache Size 1k 2k 4k

%Speedup %Speedup 32 byte blocks 64 byte blocks 7.4 11.4 4.1 7.3 0 1.5

Table 6: Percentage Speedup for instruction caches averaged over all the benchmarks using larger instruction bu er. Nor does it re ect any improvement from reorganization of code. Previous research has shown that code rearrangement can signi cantly improve cache performance (see the forthcoming section). If our scheme was implemented over code organized in descending order of usage, the improvement would have been much higher than what we obtained here.

4 Other Related Research Optimizations based on compile-time information have been observed to be very e ective in the past [9] [31] [30] [41] [42] [47]. If the compiler can nd instructions (or groups of instructions) that need to be in the cache at the same time, the performance can be improved by placing the instructions so that they do not map into the same cache block. Chang et. al. [9] developed a technique based on identifying groups of basic blocks within a routine that tend to execute in sequence. These basic blocks are then placed in contiguous cache locations. Frequent callee routines are placed immediately after their callers. By these placement techniques, the spatial locality of the code increases and improves the performance of the cache. McFarling [31] looked at algorithms to order basic blocks in such a way to increase the instruction cache hit-ratio. McFarling's technique uses a pro le of the conditional, loop and routine structure of the program. McFarling places the basic blocks in such a way that callers of routines, loops, and conditionals do not interfere with the callee routines or their descendents. Mendlson et. al. [30] employs code replication based on static information to eliminate con icts. Temam and Drach [41] used simple software information on the temporal/spatial locality of array references, as provided by current data locality optimizing algorithms to signi cantly increase cache performance. Torrellas et. al. [42] presented a scheme to optimize instruction cache performance by optimizing the layout 15

of the code in memory. They sorted code based on the usage frequency and placed code in cache-sized chunks in such a way most frequent sequences face fewer or no con icts from less frequent code. Westerholtz et. al. [47] presented schemes to improve performance of memory accessing by cache driven memory management. They observed that cache conscious page placement reduces the secondary-cache miss ratio by as much as 80%.

5 Summary and Conclusion In this paper, we presented a cache bypassing technique based on static instruction usage frequencies obtained from program pro ling. Simulation studies demonstrate that simple exclusion of rarely executed code/data from the cache does result in improvement of cache performance. For data caches, the average speedup ranges from 12% to 41% over caches of sizes ranging from 4k to 64k bytes. For instruction caches, the average speedup ranges from 0% to 12% over caches of size 1k to 4k bytes. The scheme performs better at smaller cache sizes because there is more cache pressure and there is more room for improvement. The scheme improves performance by reducing memory bandwidth requirements because the bypassed items do not result in the fetching of the entire cache block to the processor. Hence the performance is in general better at higher cache block sizes. The performance of the scheme, although modest, shows that bypassing based on usage counts can be considered to be a very simple and promising technique for improving cache performance. The major disadvantage associated with the technique is that some program pro ling may be required. However, optimizations based on pro ling have been very successful and popular in the recent past and the pro ling information required for the proposed scheme is not as detailed as what is required for many pro ling-based techniques proposed in the recent past.

16

References

[1] S. G. Abraham, R. S. Sugumar, B. R. Rau and Rajiv Gupta, \Predictability of load/store instruction latencies", In Proceedings of MICRO-26, 1993, pp. 139-152. [2] A. Agarwal, J. Hennessy, and M. Horowitz, \Cache Performance of Operating Systems and Multiprogramming", ACM Transactions on Computer Systems, 6(4):393-431, November 1988. [3] A. Agarwal and S. Pudar, \Column-Associative Caches: A Technique for Reducing the Miss Rate of Direct-Mapped Caches", In International Symposium on Computer Architecture, 1993, pp. 179-190. [4] D. B. Alpert and M. J. Flynn, \Performance Trade-o s for Microprocessor Cache Memories", IEEE Micro, August 1988, pp. 44-53. [5] T. Alanko, I. Haikala and P. Kutvonen, \Methodology and Empirical Results of Program Behavior Measurements", Proceedings of the 7th IFIP W.G.7.3 International Symposium on Computer Performance, Modeling, Measurement and Evaluation, Toronto, May 1990, pp. 55-66. [6] Alpha Architecture Handbook, Digital Equipment Corporation, 1992. [7] B. Bershad, D. Lee, T. Romer, B. Chen, \Avoiding Con ict Misses Dynamically in Large Direct-Mapped Caches", Proceedings of ASPLOS-VI, pp. 158-170. [8] D. Callahan, K. Kennedy, and A. Porter eld, \Software Prefetching", Proceedings of ASPLOS-IV, pp. 40-52. [9] W. Y. Chen, P. P. Chang, T. M. Conte, and W. W. Hwu, \The E ect of Code Expanding Optimizations on Instruction Cache Design", IEEE Transactions on Computers, 42(9):1045-1057, September 1993.

[10] C-H Chi ad H. Dietz, \Uni ed Management of Registers and Cache Using Liveness and Cache Bypass", Proceedings of the ACM SIGPLAN `89 Conference on Programming Language Design and Implementation, Vol. 24(7), pp. 344-355, June 1989. [11] P. J. Denning, \Working Sets Past and Present", IEEE Transactions on Software Engineering, Jan 1980, pp. 64-84. [12] D. R. Ditzel, \Program Measurements on a High Level Language Computer", Computer, Vol. 13. No. 8, August 1980, pp. 62-72. [13] D. Gannon, W. Jalby, and K. Gallivan, \Strategies for cache and local memory management by global program transformation", Journal of Parallel and Distributed Computing, No. 5, pp. 587-616, 1988. 17

[14] A. Gonzalez, C. Aliagas and Mateo Valero, \A Data Cache with Multiple Caching Strategies Tuned to Di erent Types of Locality", Proceedings of the ACM International Conference on Supercomputing, Barcelona, Spain, 1995, pp. 338-347. [15] Gornish E., Granston E., and Veidenbaum A., \Compiler directed Data Prefetching in Multiprocessors with Memory Hierarchies", 1990 International Conference on Supercomputing, pp. 354-368. [16] K. Gosmann, C. Hafer, H. Lindmeier, J. Plankl, and K. Westerholz, \Code Reorganization for Instruction Caches", Proceedings of the 26th Annual Hawaii International Conference on System Sciences, Vol. I, 1993, pp. 214-223. [17] R. B. Hagman and R.S. Fabry, \Program Page Reference Patterns", Proceedings of the 1982 ACM Sigmetrics Conference on Measurements and Modeling of Computer Systems, Seattle, WA, 1982, pp. 20-29. [18] D. J. Hat eld, \Experiments on Page Size, Program Access Patterns and Virtual Memory Performance", IBM Journal of R & D, Vol. 16, No. 1, Jan 1972, pp. 58-66. [19] M. D. Hill and A. J. Smith, \Experimental Evaluation of On-Chip Microprocessor Cache Memories", Proceedings of the 12th International Symposium on Computer Architecture, June 1985, pp. 55-63. [20] M. D. Hill and A. J. Smith, \Evaluating Associativity in CPU Caches", IEEE Transactions on Computers, Vol. 38, No. 12, Dec 1989, pp. 1612-1630. [21] i860 XP Microprocessor Data Book, Intel Corp., 1991. [22] N. P. Jouppi, \Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Bu ers", Proceedings of the 17th International Symposium on Computer Architecture, 1990, pp. 364-373. [23] B. W. Kernighan, \Optimal Sequential Partitions of Graphs", JACM, Vol 18., No.1, 1971, pp. 34-40. [24] D. R. Kerns and S. J. Eggers, \Balanced Scheduling: Instruction Scheduling when memory latency is uncertain", Proceedings of the SIGPLAN 93 Conference on Programming Language Design and Implementation, pp. 278-289. [25] R. E. Kessler, M. D. Hill, and D. A. Wood, \A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches", IEEE Transactions on Computers, Vol. 43, No. 6, June 1994, pp. 664-675. [26] A. C. Klaiber and H. M. Levy, \Architecture for Software Controlled Prefetching", Proceeding sof the 18th International Symposium on Computer Architecture, pp. 4363, 1991. 18

[27] Kroft D., \Lockup-free Instruction Fetch/Preftch Cache Organization", Proc. of the 8th Annual Intl. Symp. on Computer Architecture, pp. 81-87, June 1981. [28] M. S. Lam, E. E. Rothberg and M. E. Wolf, \The Cache Performance and Optimizations of Blocked Algorithms", Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems", 1991, pp. 63 { 74. [29] A. R. Lebeck and D. A. Wood, \Cache Pro ling and the SPEC Benchmarks: A Case Study", IEEE Computer, October 1994, pp. 15-26. [30] A. Mendlson, S. Pinter, and R. Shtokhamer, \Compile Time Instruction Cache Optimizations", In Computer Architecture News, pp. 44-51, March 1994. [31] S. McFarling, \Program Optimization for Instruction Caches", Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems", 1989, pp. 183 { 191. [32] S. McFarling, \Cache Replacement with Dynamic Exclusion", Proceedings of International Symposium on Computer Architecture, May 1992, pp. 191-200. [33] K. S. McKinley, \Automatic and Interactive Parallelization", Ph. D. thesis, Rice University, Technical Report CRPC-TR92214, April 1992. [34] UMIPS-V Reference Manual, MIPS Computer Systems, Sunnyvale, California, 1990. [35] User's Guide to spix, spixstats and shade, Sun Microsystems. [36] S. Palacharla and R.E. Kessler, \Evaluating Stream Bu ers as a Secondary Cache replacement", Proceedings of the 21st International Symposium on Computer Architecture, April 1994, pp. 24-33. [37] K. Pettis and R. C. Hansen, \Pro le guided code positioning", Proceedings of the ACM SIGPLAN 90 Conference on Programming Language Design and Implementation, pp. 16-27, 1990. [38] Steven Przybylski, M. Horowitz, and J. Hennessy, \Performance Tradeo s in Cache Design", Proceedings of the 15th Annual Symposium on Computer Architecture, pp. 290-298, IEEE Computer Society Press, June 1988. [39] Michael D. Smith, \Tracing with pixie", Technical Report No. CSL-TR-91-497, Computer Systems Laboratory, Stanford University, Stanford, California. [40] J. E. Smith and W-C Hsu, \Prefetching in supercomputer instruction caches", Proceedings of Supercomputing 92, pp. 588-597, 1992. 19

[41] O. Temam and N. Drach, \Software Assistance for Data Caches", Proceedings of the High Performance Computer Architecture Symposium, Jan 1995, pp. 154-163. [42] J. Torrellas, C. Xia, and R. Daigle, \Optimizing Instruction Cache Performance for Operating System Intensive Workloads", Proceedings of the High Performance Computer Architecture Symposium, Jan 1995, pp. 360-369. [43] D. Tuite, \Cache architectures under pressure to match CPU performance", Computer Design, March 1993, pp. 91-97. [44] G. Tyson, M. Farrens, J. Matthews, and A. Pleszkun, \A Modi ed Approach to Data Cache Management", Proceedings of MICRO-28, 1995, pp. 93-103. [45] The WARTS tool suite, University of Wisconsin, Madison [46] S. Wallach, \The CONVEX C{1 64-bit Supercomputer", Compcon Spring 85, February 1985. [47] K. Westerholz, Stephan Honal, J. Plankl, and C. Hafer, \Improving Performance by Cache Driven Memory Management", Proceedings of the International High Performance Computer Architecture Symposium, Jan 1995, pp. 234-242.

20