A Software Strategy to Improve Cache Performance - CiteSeerX

2 downloads 0 Views 123KB Size Report
increasing cache flexibility; software techniques [13-23] modify program structure ... possible mappings: therefore, a brute force attack to the optimum problem is ...
A Software Strategy to Improve Cache Performance S.Bartolini and C.A.Prete Dipartimento di Ingegneria dell’Informazione, Universitá di Pisa, Pisa, Italy {s.bartolini, prete}@iet.unipi.it

Abstract In embedded systems, cost, power consumption, and die size requirements push the designer to use small and simple cache memories. Such caches can provide low performance because of limited memory capacity and inflexible placement policy. A way to increase the performance is to adapt the program layout to the cache structure. This strategy needs the solution of a N-P complete problem and a very long processing time. We propose a strategy to look for a near optimum program layout within a reasonable time by means of smart heuristics. This solution does not add code and uses standard functionality’s of a linker to produce the new layout. Our approach is able to reduce up to 70% the misses in case of a 2-kbyte direct access cache.

1

Introduction

In embedded system design, the caches [4], [5] are crucial components for performance and system cost: on one hand, the cache allows systems to employ slow off-chip memory/bus, while, on the other hand it consumes resources (e.g. die size, power) and increases overall circuit complexity. Typically small caches show high miss rates; anyway bigger caches can be less effective than expected. Caches can be poorly exploited because of the mismatch between sequence of program memory accesses and cache structure. This situation happens when the program working set is mapped onto a small number of cache sets; in this way, lots of useful memory block copies are mapped on a few cache blocks and so are often replaced, while other cache sets are rarely accessed. This scenario causes conflict misses. Balancing the cache block load can reduce conflict misses, resulting in higher cache exploitation. Both hardware and software approaches have been proposed for improving cache performance: hardware techniques [6-12] introduce special cache architectures for increasing cache flexibility; software techniques [13-23] modify program structure for achieving better performances on the given cache. In embedded systems software techniques are more suitable because they don’t introduce extra cost and complexity. We propose a software technique which maximizes the performance of small and simple caches via program layout re-structuring. The proposed solution works on object files

and outputs the placement indications for functions and data sections, needed by the linker to build the executable image. In this way, development tools don’t need any modifications and we can also optimize libraries without knowing the source code. In addition, as the strategy doesn’t act at source or assembler code, it can be easily ported on other systems, and can be completely programmer transparent. We use an innovative approach that takes into account very precisely the program behavior in each step of placement. This choice results crucial for effective reduction of conflict misses. Optimized programs running on systems having small caches show better results than the original versions over caches 2 or 4 times bigger. Moreover optimizing can allow direct access caches that perform very close to associative caches.

2

Cache efficiency improvement

The ideal cache tries to hold exactly the blocks that the application will need again in the future (Belady’s optimal cache [3]). The performance gap between such a cache and a real one originates from the mismatch between the sequence of memory accesses, generated by the program, and cache features (i.e., mapping and replacement policies). A program generally uses the cache in an unbalanced way: in a time interval, several used memory blocks are mapped on the same cache sets. This situation causes conflict misses and poor efficiency in cache exploitation. Let us define the cache efficiency in case of a specific application: a cache, with S blocks, has an efficiency e (0 < e ≤ 100%) if a Belady’s cache with e ⋅ S blocks 100

produces a miss rate very close to the one of the given cache. In this way, e represents the fraction of cache capacity which is efficiently used by the program. Figure 1 shows efficiency values for a mp3 encoder (mp3_enc) application in case of a direct access I-cache; the original layout exploits, on average, only 25% of the cache capacity. The basic idea of the technique is to increase cache efficiency, and therefore conflict misses, through the modification of program layout. Figure 1 highlights the high efficiency values for the same application with the new layout. Table 1 summarizes the efficiency for a set of applications: original efficiency values span from 15% (4 Kbytes, mp3_enc) to 59% (1 Kbytes, cjpeg); programs with the new layout show efficiency values ranging from

* The present work has been carried out in the framework of the Esprit Project SPP, "Scalable Peripheral Processor," contract no. 29173. The project consortium parties are: C-Map, Marina di Carrara, Italy; Alcatel Microelectronics, Zaventem, Belgium; Cetrek, Poole, UK; Centro TEAM, Pisa, Italy and Dipartimento di Ingegneria dell'Informazione- University of Pisa, Italy.

37% (8 Kbytes, mancala) to 91% (8 Kbytes, cjpeg; 4 Kbytes, mp3_enc). Mancala shows a relatively low optimized efficiency on a 8-Kbyte cache: the reason is that on both 4-Kbyte and 8-Kbyte cache, all misses in the optimized layout are only cold misses, therefore, the bigger capacity of the 8-Kbyte cache is wasted. This efficiency improvement results in a miss rate reduction ranging from 20% to more than 90%, with an average of 73%. We use a software technique that changes the relative position of code portions and data sections (areas). The proposed re-mapping technique allows not to deal with assembly code of the target machine, resulting in a portable approach which can be easily included in existing development tools. The problem is to find good placements of the application’s procedures for obtaining a balanced load on cache sets; it’s a tricky problem, in fact one procedure position influences the miss number, and the good positions, of the other procedures. Looking for the optimum placement is a NP-complete problem. A program, which has P areas and runs on a system, having a cache with S sets, generates a search space of S P possible mappings: therefore, a brute force attack to the optimum problem is typically infeasible. For example, a program having 100 areas may have 64100 (4.1495156E+180) possible mappings in case of a cache with 64 sets. As a consequence, only non optimal solution can be practically found and the problem is to obtain a very good quasi optimum solution. Efficiency (%)

100 90 80 Original

70

Optimized

60 50 40

features of memory access sequences through various program modifications, aiming to better cache performances. Within hardware proposals: splitted caches are employed for references having different locality features [6, 7]; trace caches [8] store traces of the dynamic instruction stream to increase fetch bandwidth; victim cache [9] was proposed to reduce miss penalties due to replacement algorithm’s faults and set overloading; random [10] and XOR-based [11] cache mapping are considered for eliminating conflicts due to regular access patterns. Phalke et al. [12] propose a sophisticated replacement algorithm for a better block reuse prediction. Anyway, very flexible hardware techniques are hard to implement. Software techniques change the program before or during compilation in order to improve cache usage. Typically loops and/or data are taken into consideration. Kandemir et al. [13] present joined loop and data transformations; loop tiling and data alignment are proposed by Panda et al. [14]. Other techniques focus on procedure merging [15], while Mendlson et al. [16] propose conflict miss avoidance through code replication and proper loop-code allocation into the cache. Table 1: Original (O) and after placement (P) efficiency (%) for various applications in case of a direct access cache, with 32-byte block size. The cache size varies from 1 Kbytes to 8 Kbytes. 1 KBytes

2 KBytes

4 KBytes

8 KBytes

Application

O

P

O

P

O

P

O

P

Lisp Compress Cjpeg Empeg Mp3_enc Gps Mancala

34 56 59 46 40 43 34

46 81 78 71 78 71 81

31 46 32 32 28 29 42

68 62 90 84 76 59 90

24 30 17 20 15 23 32

73 79 90 75 91 71 75

19 34 38 10 18 30 16

68 76 91 75 97 69 37

30 20 10 0 1K

2K

4K

8K Cache size (Bytes)

Figure 1: Efficiency of the mp3_enc application before and after optimization, for a direct access I-cache from 1 Kbytes to 8 Kbytes. The cache has a poor efficiency with the original layout and increasing cache size doesn’t guarantee high cache exploitation. Layout optimization allows a more effective exploitation of the available cache capacity, leaving a negligible improvement margin.

3

Related Works

We present a short summary of software and hardware techniques for improving cache performances. Both hardware and software techniques have been proposed for improving program-cache compatibility: the first ones [6-12] improve cache flexibility thanks to special cache architectures; the second ones [13-23] change the

Other software approaches rely on program layout optimization through repositioning of compiled code and/or data elements. Chilimbi et al. [17] propose a run time technique to improve the data cache locality of pointer manipulating programs. Techniques that rearrange compiled-code can perform basic blocks reordering [2022], procedure re-mapping [18-22] and procedure splitting [20, 22]. Basic blocks reordering can produce longer jumpless instruction sequences, thus increasing prefetch effectiveness. Basic blocks placement is usually done counting the switch frequency between basic blocks and then forming chains of basic blocks that tend to execute in sequence, within the same procedure of the application code [21, 22] or even across different procedures of operating system code [20]. Basic blocks reordering usually gives good results, anyway it implies working on assembly code of the target machine. Procedure re-mapping aims to place procedures so that few cache conflicts are generated at run-time. Typically the interaction of procedures is modeled through the call frequency between caller-callee pairs [21, 22]: anyway this

Miss rate (%)

Miss rate (%)

10

9 Original

9

Optimized

Original

Optimized

8

8

7

7

6

6

5

5 4

4

3

3 2

2

1

1

0

0 1K

2K

4K

8K

16K

32K

1K

2K

4K

8K

Cache size (Bytes)

mp3_enc

16K

32K

Cache size (Bytes)

empeg

Figure 2: Miss rate for both the original and optimized mp3_enc and empeg applications for a direct access I-cache with block size of 32 bytes versus cache size. The gray area sets a lower bound to the miss rate achievable through program area mapping. The optimized application for cache size ranging from 1 Kbytes to 8 Kbytes performs better than the original one for a 4 times larger cache.

kind of parameter can only detect procedure interaction, but doesn’t carry any indication to achieve good placements. As a consequence, these approaches, except for Kalamatianos et al.[19], use a mechanical placement strategy which maps potentially conflicting procedures one besides the other in memory space, trying to map them on the same cache page. Basic blocks reordering lets splitting procedures into the executed and non-executed parts, and so the placement can deal only with the first parts, increasing mapping effectiveness. Anyway, the cache modulo effect isn’t managed during procedure placement, and so if the high interacting procedure set exceeds the cache page size, the improved layout can generate even more misses than the original [18]. Moreover, only caller-callee interference is considered: in this way, conflicts among procedures which don’t call each other directly are not precisely evaluated. Such conflicts can drive high conflict miss rates though. Kalamatianos et al. [19] show how a more precise modeling of procedure conflicts can drive better results: the proposed CMG method gathers conflict information also among procedures which don’t call each other directly, and the conflict detection unit is the cache block instead of the whole procedure. Finite cache size is taken into account as well, and a cache line coloring algorithm is adopted for placement. The authors present also their previous method (CGO) which weights procedure interference only with switch frequencies. CMG outperforms CGO thanks to the more accurate estimate and modeling of conflicts. We are convinced that a very precise characterization of program behavior can produce significant results. In previous work [23], we presented a fast placement technique which used a trace analysis to collect information on program behavior. Then procedures were placed according to a heuristic function.

4

Proposed technique

Our proposal allows us to obtain a near optimum placement for program procedures and data sections

(areas). We use a Branch and Bound search algorithm for finding an optimum placement; anyway the algorithm stops, in a predictable time, when the first feasible solution is reached. As conflict miss phenomenon is quite complex, it’s hard to estimate the misses of each layout. For this reason, we simulate the program execution during the placement, so that every kind of interference among procedures can be considered. The program execution is simply modeled through its trace of memory accesses, as it is exactly the program behavior the cache deals with. The trace of program addresses allows a precise conflict miss control during placement: cache modulo effect of long areas, and interference between areas which don’t call each other directly are automatically taken into account. Moreover, misses due to the cache replacement policy can be easily considered. Another important feature of our approach resides in taking decisions on placement during placement itself; in fact simulation lets the algorithm control the solution search step by step, instead of employing a mechanical placement scheme driven by profile information. Simulation time is the cost of the method’s features and effectiveness. For this reason, we tuned a non lossy trace compaction technique in order to reduce the processing time: a modified Puzak [26] scheme filters out only the references which don’t modify cache status, whatever procedure placement. The average trace compaction results, for the considered applications, are about 90% reduction for I-caches and about 50% reduction for unified caches; the trace length reductions result in a similar processing time decrease. In addition, computational time can be reduced observing that only the most critical areas (e.g. the first 10 – 15 areas in the examined applications) need to be placed accurately, because they are responsible for the overall cache performance: low criticality areas usually produce few misses, which are almost the same, whatever the area placement, so they can be easily placed through a simple

Miss rate (%)

Miss rate (%)

7

7 Original

Optimized

Original

6

6

5

5

4

4

3

3

2

2

1

1

0

Optimized

0 1

2

4

8

16

Fully

1

2

4

8

Associativity

16

Fully Associativity

mp3_enc

empeg

Figure 3: Miss rate for both the original and optimized mp3_enc and empeg applications for a 2-Kbyte I-cache with 32-byte block size versus the way number. The gray area sets a lower bound to the miss rate achievable through program area mapping. The optimized application on the direct access cache performs better than the original whatever the cache associativity.

heuristic. Moreover, not the whole program needs to be optimized, because, typically, only some of the system functionalities are heavily used and determine the overall performance. Our technique works in three phases: procedure ordering, mapping search, and memory allocation.

4.1

Procedure ordering

Our placement strategy lays an area at a time and looks for its best position according to the already placed areas. This strategy is very effective if the most critical areas are placed first, so that their placement isn’t affected by less critical areas. An area, respect to a set of given areas, is regarded as critical if only a few of its placements produce low miss rates and it has many placements which produce high miss rates. The ordering phase orders areas according to our proposed estimate of criticality. Simple estimates, fail to capture problem complexity: in fact criticality must take into account both reference spatial and temporal localities. In addition, every correct criticality measure can’t depend on any particular placement, such as the original one. The ordering phase starts evaluating the criticality of each pair of areas so that the most critic pair of areas are the first in the order. Then it evaluates each remaining area’s criticality based on the areas already selected: the most critical is selected. This phase ends when all areas have been considered. We define the placement criticality of area b respect to a set A of the already placed areas through the following empirical expression: Criticality(A,b) = ShapeFacto r(A,b) ⋅ MissCapability(A,b)

Where: MissCapability(A, b): is the maximum increase of misses obtained adding the area b, to the already placed area set A. This parameter takes into account the temporal interaction among procedures. ShapeFactor(A,b) of area b relative to the set A of

already placed areas, measures the difficulty to determine good positions for the area b respect to the area set A. For this purpose, we represent the spatial access distribution of an area as the distribution into the cache space of all its references, excluding the ones that hit whatever the program layout. The parameter has a high value if the access distribution of the area b is not compatible with the ones of the areas in A. It spans from 0 to 1 and considers the spatial interaction among procedures. In this way, both temporal and spatial access features are considered and, therefore, an area can have the same criticality because of temporal or spatial interaction respectively with the other areas.

4.2

Mapping search

In this phase the algorithm looks for the best position of each area within the cache address space, using the order obtained in the previous phase. The search starts considering only the first two areas as the input of the cache, so determines their relative position which gives the minimum number of misses. Then it considers one area in each step, and looks for its best position, relative to the partial layout given by the previously placed areas. This is done evaluating the misses of the already placed areas, together with the new one in all its possible cache positions and selecting the best one. The search algorithm applies a branch and bound technique that uses, as a cost function, the number of misses of the areas considered in each step1.

1

A correct cost function must be non decreasing with the number of placed areas. This property is if we operate on direct access caches and associative caches with LRU replacement policy; for instance it is false for associative caches with FIFO replacement scheme

The algorithm places one area at a time in its best position: anyway a very good placement of the first areas can result in the impossibility of placing well the subsequent areas. So, the real problem is to place an area in a good position, taking into account also the areas that will be placed later; this problem is difficult because subsequent areas’ influence has to be considered without any assumption on their position. In addition, in order to be useful in driving placement, the influence of the following areas must be a function of the position of the area to be placed in the algorithm’s step. A lot of effort has been done in estimating such influence because smart estimators can help the algorithm avoid apparently good search branches which lead to bad placements and to highlight good search directions which could appear bad. For this reason, good estimators help achieve near optimum first feasible solutions. Table 2: Original (O) and after placement (P) miss rates for various applications in case of a direct access cache, with 32-byte block size; the cache size varies from 1 Kbytes to 8 Kbytes.

Miss rate (%) Application

Lisp Compress Cjpeg Empeg Mp3_enc Gps Mancala

4.3

1 KBytes

2 KBytes

4 KBytes

8 KBytes

O P 8,16 6,59

O P 5,31 2,51

O P 3,74 0,65

O P 1,98 0,02

1,68 1,43 8,16 9,16 9,35 4,43

0,28 1,18 6,08 6,55 7,12 2,09

0,20 0,97 4,28 5,84 3,82 0,99

0,10 0,07 3,92 1,66 1,08 0,99

0,50 0,60 5,24 4,38 5,90 2,17

0,19 0,23 0,24 1,57 2,72 0,417

0,08 0,04 0,01 0,42 0,71 0,01

0,01 0,02 0,01 0,01 0,21 0,01

Memory allocation

The search phase outputs the cache offsets of each area involved in the search. Placing the areas in memory positions that verify a particular cache offset may lead to memory fragmentation. Anyway, an area with a specific MemorySize positions in main cache offset can have K=

CachePageSize

memory, therefore the role of this phase is to determine the memory mapping of each area so that cache mappings are verified and memory holes are minimized. The problem is simplified by the presence of areas that can have any cache placement and can be used for filling the memory holes. Such areas are the non critical areas and the program sections that don’t need optimizations. There is a slight flexibility in filling memory holes, in fact, holes can vary their width, by multiples of the cache page size, to hold better the free areas; in this way, memory allocation can be easily managed through a best fit scheme, modified for varying size holes.

5

Results

We have considered a set application typically used in embedded systems: mp3 encoder (mp3_enc), mpeg and jpeg compressor (empeg, cjpeg), gps algorithm and a

strategy game (mancala). Mancala game optimization has been done for an unified cache, so data sections too were considered for placement. Moreover two SpecInt95 benchmarks (lisp and compress) are used for comparison purposes with other techniques. All applications have been compiled in the VLSI Technology JumpStart development toolkit for ARM 7 processor. Traces are generated using the integrated debugger; references to I/O space and auxiliary system calls are not traced because they wouldn't be cached or present in the target system. The trace length varies from 2,3 millions references for gps to 53 millions references for empeg. Table 3: Original (O) and after placement (P) miss rates for various applications in case of a 2-Kbyte cache with 32byte block size as a function of associativity.

1 way Application

Lisp Compress Cjpeg Empeg Mp3_enc Gps Mancala

Miss rate (%) 2 ways 4 ways

fully assoc.

O P 5,31 2,51

O P 3,64 2,06

O P 3,10 2,09

O 2,73

P -

0,28 1,18 6,08 6,6 7,12 2,09

0,26 0,39 3,44 3,2 4,20 1,87

0,22 0,37 1,29 2,3 3,74 1,64

0,24 0,48 0,27 2,0 2,92 1,11

-

0,20 0,23 0,24 1,57 2,72 0,42

0,20 0,26 0,60 1,61 2,72 0,52

0,21 0,31 0,49 1,70 2,91 0,57

Figure 2 shows the miss rates for the original and optimized mp3_enc and empeg application for a direct access cache as a function of cache size. The miss rate reduction is significant for every cache size; the smallest reduction for mp3_enc (about 50%) is obtained with a 1Kbyte cache: anyway reducing the high original miss rate from 9,2% to 4,4% can drive a noticeable average memory latency reduction. Moreover, such reduction makes the optimized application for the 1-Kbyte cache perform better (4,38%) than the original one for the 4-Kbyte cache (5,8%). Also for the 2-Kbyte cache the optimized miss rate (1,57%) is better then the 8-Kbyte cache miss rate (1,66%). The optimized empeg on a 1-Kbyte cache is better than the original on 2-Kbyte cache, and on a 2-Kbyte cache, the miss rate is negligible. Conversely, the miss rate of the original empeg layout drops only for a 32-Kbyte cache, which solves program conflicts through a capacity increase. Table 2 confirms the results for all the examined applications; the only exceptions are: Lisp for a 1-Kbyte cache, where the reduction from 8,2% is not enough to reach 5,3% miss rate of the 2-Kbyte cache and mancala for the 1-Kbyte cache: 2,2% miss rate with respect to 2,1% of the 2-Kbyte cache for the original layout. Usually set associative caches are adopted to increase cache flexibility: the approach is based on the hypothesis that memory references exhibit temporal locality effectively exploitable through LRU policy. In Figure 3 we analyze the optimizer performance respect to associativity increase. We can see that an accurate program area mapping drives better results than increasing cache associativity. The optimized applications have a very stable behavior over change in

associativity. This fact dramatically reduces the benefits of employing associative caches. The slight increase in the optimized miss rate when associativity increases is due to two phenomena. Firstly, cache conflicts are more and more managed by replacement policy and so are harder to control through placement. Then, fewer sets limit algorithm’s possibility of placing program areas. Table 3 confirms the previous observations to be valid for all applications on a 2-Kbyte cache; there are only two exceptions for lisp on a 2-way cache and empeg on a 4way cache, where the optimized layout benefits from associativity increase. The optimizer cannot work on a fully associative cache because there is only one set and so each program area is trivially mapped on it.

[12]

[13]

[14]

[15]

Acknowledgements We would like to acknowledge the contributions of Umberto Caselli to this work.

[16]

References

[17]

[1] R.A.Sugumar and S.G.Abraham, "Efficient Simulation of Caches under Optimal Replacement with Applications to Performance Evaluation Miss Characterization", Review,vol.21, no.1; June 1993; p.24-35. [2] M.D.Hill and A.J.Smith, "Evaluating Associativity in CPU caches", IEEE Transactions on Computers, vol. 38, no. 12, pp.1612-1630, December 1989. [3] L.A. Belady, "A study of replacement algorithms for a virtual-storage computer", IBM System Journal, Vol. 5, No. 2, pp. 78-101, 1966. [4] A.J. Smith, “Cache Memories”. Computing Surveys, vol. 14, no. 3, ACM, September 1982, pp.473-530. [5] C.A.Prete, M.Graziano and F.Lazzarini. “The ChARM tool for tuning embedded systems”. IEEE Micro, 17(4):67-75, July/August 1997. [6] V.Milutinovic, B.Markovic, M.Tomasevic and M.Tremblay. "The Split Temporal/Spatial Cache" Proceeding of SCIzzL-5, Santa Clara, California, USA, pp. 63-69, March 1996. [7] A.González, C.Aliagas and M.Valero, “A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality”, Proceedings of ACM ICS 95, Barcelona, Spain, pp.338-347. July 1995. [8] E.Rotemberg, S.Bennet, J.E.Smith. “A Trace Cache Microarchitecture and Evaluation”. IEEE Transactions on Computers, Special Issue on Cache Memory, vol. 42, no. 2, February 1999, pp. 111-120. [9] N.P.Jouppi "Improving Direct-mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers". Proceedings of 17th ISCA, Seattle, WA, USA, pp. 364-373, June 1990. [10] N.Topham, A.Gonzáles "Randomized Cache Placement for Eliminating Conflicts" IEEE Transactions on Computers, Vol. 48, No. 2, February 1999, pp.185-192. [11] A.González, M.Valero, N.Topham and J.Parcerisa, "Eliminating Cache Conflict Misses Through XOR-Based

[18] [19]

[20]

[21]

[22]

[23]

[24] [25]

[26]

Placement Functions", Proc. International Conference on Supercomputing, Vienna, Austria, July 1997, pp.76-83. V.Phalke, B.Gophinath. “Compression-Based Program Characterization for Improving Cache Memory Performance”. IEEE Transactions on Computers, Vol. 46, No. 11, November 1997, pp.1174-1186. M.Kandemir, J.Ramanujam, A.Choudhary. “Improving Cache Locality by a Combination of Loop and Data Transformations”. IEEE Transactions on Computers, Vol. 48, No. 2, February 1999, pp.159-167. P.Panda, H.Nakamura, N.Dutt, A.Nicolau. “Augmenting Loop Tiling with Data Alignment for Improved Cache Performance”. IEEE Transactions on Computers, Vol. 48, No. 2, February 1999, pp.142-149. S. McFarling, "Procedure Merging with Instruction Caches", ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, Toronto,Ontario, Canada, June 26-28, 1991, pp.71-79. A.Mendlson, S.Pinter and R.Shtokhamer, "Compile Time Instruction Cache Optimizations", Computer Architecture News, Vol. 22, No. 1, March 1994, pp.44-51. T.M.Chilimbi, M.D.Hill, J.R.Larus. “Cache-Conscious Structure Layout”. Proceedings of the ACM SIGPLAN'99 Conference on Programming Language Design and Implementation, Atlanta, GA USA, May 1999, pp.1-12. D. Scales. “Efficient Dynamic Procedure Placement”, WRL Research Report 98/5, August 1998. J.Kalamatianos, A.Khalafi, D.Kaeli, W.Meleis. “Analysis of Temporal-Based Program Behaviour for Improved Instruction Cache Performance”. IEEE Transactions on Computers, Vol. 48, No. 2, February 1999, pp.168-175. J.Torrellas, R.Daigle. “Optimizing the Instruction Cache Performance of the Operating System”. IEEE Transactions on Computers, Vol. 47,No. 12, Dec.1998, pp.1363-1381. W.W.Hwu and P.P.Chang, "Achieving High Instruction Cache Performance with an Optimizing Compiler", ISCA'89, Proc. of the 16th Annual ISCA, ACM SIGARCH Computer Architecture News, Vol. 17, No. 3, June 1989, pp.242-251. K.Pettis and R.C.Hansen, "Profile Guided Code Positioning". Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, ACM, June 1990, pp.16-27. S. Lorenzini, G. Luculli, C.A. Prete. “A Fast Placement Algorithm for Optimal Cache Use”. melecon ’98, Procs. 9th Mediterranean Electrotechnical Conference, IEEE, May 18-20 1998, Tel-Aviv Israel, pp.1279-1283. A.Agarwal. “Analysis of cache performance for Operating Systems and multiprogramming”. Kluwer Academic Publishers. 1989. A.J.Smith. “Two methods for the Efficient Analysis of Memory Address Trace Data”. IEEE Transactions on Software Engineering, SE-3(1), January 1977, pp.94-101. T.R.Puzak. “Analysis of cache replacement algorithms”. PhD thesis, University of Massachusetts, Department of Electrical and computer engineering, February 1985.