Improving the YAGS Branch Predictor - Semantic Scholar

1 downloads 0 Views 2MB Size Report
4.1 The Unified Direction Prediction Cache . . . . . . . . . . . . 5 ... 4.3 Analysis over Wide Bit Budget Range . ... This paper also proposes to unify the caches of YAGS.
Improving the YAGS Branch Predictor Hans Vandierendonck ELIS Technical Report 2005-003 August 2005

VAKGROEP ELEKTRONICA EN INFORMATIESYSTEMEN St.-Pietersnieuwstraat 41 B-9000 Gent

Contents 1 Introduction

1

2 The YAGS Branch Predictor

1

3 Methodology

3

4 Evaluation 4.1 The Unified Direction Prediction Cache . . . . . . . . . . . . 4.2 Tagging Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Analysis over Wide Bit Budget Range . . . . . . . . . . . . .

4 5 7 12

5 Related Work

13

6 Conclusion

14

i

Abstract This paper presents an analysis of the YAGS branch predictor. YAGS is a bimodal backing predictor that is overridden by a direction prediction cache, a tagged branch predictor indexed with branch address and global branch history. This paper presents several improvements to the YAGS scheme. First, we unify the taken and not-taken direction prediction caches. This effectively increasing branch history correlation. Second, we show that associativity in the cache is beneficial. Previous work stated that it is best to leave the cache direct mapped. This paper shows that higher associativity improves predictor accuracy with 10% provided that global branch history information is added to the tags.

2

1

Introduction

The YAGS (Yet Another Global Scheme) branch predictor [EM98] is a global history branch predictor designed to counter interference. Roughly speaking, it consists of a cache of branch predictions for certain pairs of branch history and branch address. The prediction is taken from the cache when it contains information on the current history and branch address. Otherwise, the prediction is taken from the backing bimodal predictor. The aim of this paper is to provide a deeper understanding of the YAGS predictor and, building on the presented insights, improve its accuracy. This paper shows that the accuracy of YAGS depends in a non-trivial way on its parameters. E.g., Eden and Mudge [EM98] found associativity in the cache of little use. On the one hand, accuracy increases with increasing associativity as there is less aliasing. On the other hand, history length becomes shorter resulting in poorer correlation with previous branch instances. The net effect is that associativity is not really worthwhile. We show that this conclusion is not always valid. In fact, associativity can be really worthwhile, but this depends on (i) the length of the tag and (ii) the information stored in the tag, i.e., branch address, global history or a combination of both. This paper also proposes to unify the caches of YAGS. YAGS, in fact, has two caches. When the backing predictor predicts taken, YAGS looks in the ’not-taken’ cache and when the backing predictor predicts not-taken, YAGS looks in the ’taken’ cache. We show that all information may be held in a single unified cache. This modification improves accuracy as it exposes additional branch history correlation. Furthermore, the unified YAGS is less sensitive to associativity. The remainder of this paper is structured as follows. The YAGS branch predictor is discussed in detail in Section 2. The dependence of the accuracy of YAGS on its parameters is analyzed using simulation in Section 4 and a more accurate YAGS is proposed. Related work is discussed in Section 5. Section 6 concludes the paper.

2

The YAGS Branch Predictor

The YAGS branch predictor [EM98] was devised to increase prediction accuracy by removing aliasing. Hereto it combines ideas from previously known branch prediction schemes, such as the gshare [McF93], the gskew [MSU97] and the bi-mode [LCM97] schemes. The bi-mode scheme removes aliasing

1

by dynamically assigning branches to either one of two classes: biased towards taken and biased towards not-taken. Depending on this class, the prediction is then drawn from either the taken direction table or from the not-taken direction table. Both of these tables are equally sized gshare predictors. YAGS improves on this scheme by using tags and associativity to remove aliasing in the direction prediction tables. history

address

history

address

+

+

choice PHT

backing predictor cache

T cache tag

2bc

NT cache 2bc

tag

tag

2bc

=

=

= prediction

prediction

(a) Original YAGS scheme

(b) YAGS with unified direction cache

Figure 1: YAGS predictor organization: original scheme (left) and YAGS scheme with a unified direction cache (right). The operation of YAGS is detailed in Figure 1(a). A branch is looked up in the bimodal choice PHT which effectively predicts the branch bias. When the choice PHT predicts a taken direction, this same branch is looked up in the not-taken cache in order to see if the current instance of that branch disagrees with the branch bias. Hereto, global branch history is used besides the branch address to identify a disagreeing branch, as a branch typically disagrees only in particular situations, i.e., for certain branch histories. When there is a hit in the not-taken cache, the current instance of the branch may disagree and the prediction is taken from the not-taken cache. Similarly, when the choice PHT predicts not-taken, the branch is looked up in the taken cache to see if it makes an exception to the not-taken bias. After a branch is executed, not all prediction tables are updated with the branch information. The taken cache is updated only if it was used to make a prediction or when the branch outcome is taken and the choice PHT predicted not-taken. Similarly, the not-taken cache is updated if it was used to make a prediction or when the branch outcome is not-taken and the 2

choice PHT mispredicted. The choice PHT is updated when it was used, when it predicted correctly, or when the direction caches predicted wrongly. To avoid aliasing in the direction prediction caches, it was proposed to introduce associativity [EM98]. The replacement policy used is a modified LRU scheme, where the least-recently-used branch is evicted first, except when there is a branch whose 2-bit saturating counter is biased opposite to the direction of the cache, i.e., not-taken in the taken cache or taken in the not-taken cache. A pure LRU scheme is used throughout this paper, as we found that this small modification has little impact on prediction accuracy. The separation of the direction prediction cache into a taken and a nottaken cache appears artificial and may constrain prediction accuracy because of limited capacity in either the taken cache or the not-taken cache. Especially the presence of a 2-bit saturating counter in each entry in the taken and not-taken caches makes this a non-obvious design choice. Therefore, we present a modified YAGS scheme with a unified direction cache (Figure 1(b)). To make a prediction, the direction cache and choice PHT (which we now call a backing predictor) are accessed simultaneously. If there is a hit in the direction cache, then the direction cache delivers the prediction. Otherwise, the prediction is drawn from the backing predictor. The direction cache is updated using the same rules as the original YAGS: it is updated when the prediction is drawn from the direction cache or the when the backing predictor mispredicted. The backing predictor itself does not have to store all information. In fact, it only needs to be updated when the prediction was drawn from it, when it predicted correctly or when the direction cache predicted wrongly. The organization of Figure 1(b) is the closest possible variant of YAGS where the prediction tables are unified. Note that the original YAGS scheme is obtained when the index of the direction cache is restructured in such a way that one index bit equals the prediction of the backing predictor. Then, the unified cache is logically split in two direction caches.

3

Methodology

The branch predictors in this paper are evaluated by means of trace-based simulation of SPEC2000 integer benchmarks. The benchmarks are compiled on an Alpha system using the Compaq compilers with optimization flags set to -arch ev6 -fast -O4 for C programs and to -arch ev6 -O2 for C++ programs. Traces of 50 million branches are obtained using SimPoint 1 1

http://www.cs.ucsd.edu/~calder/SimPoint/.

3

except for vortex, where the trace designated by SimPoint has a branch misprediction rate of less than 1% with a bimodal predictor. A suitable trace for vortex was manually identified. Table 1: Summary information about the benchmarks, showing the number of instructions skipped before simulation (column ‘skip’), the number of instructions simulated (column ‘insn’), the number of conditional branches in the trace (column ‘cbr’) and the number of all branches in the trace (column ‘br’). Name bzip2 crafty eon gap gcc gzip mcf parser perl twolf vortex

4

input program reference kajiya (ref) reference 166.i graphic reference reference scrabble (SPEC95) reference reference no. 3

skip 6.0G 74.5G 15.9G 20.1G 13.9G 54.6G 31.5G 533.3G 3.7G 233.2G 150M(br)

insn 50M 50M 50M 50M 50M 50M 50M 50M 50M 50M 60.2M

cbr 49.0M 40.7M 22.0M 44.0M 73.5M 33.2M 88.0M 54.6M 44.1M 50.9M 50.0M

br 62.6M 54.0M 46.6M 66.7M 75.4M 48.2M 115.8M 76.5M 67.4M 62.0M 80.2M

Evaluation

Improvements to the YAGS scheme are proposed one by one based on insights gathered from simulation results. The baseline scheme is the original YAGS scheme with a choice PHT of 16384 2-bit saturating counters. Each direction prediction cache has 2048 entries, each containing a 6-bit tag and 2-bit saturating counter. The caches are direct mapped. This totals to about 64 Kibits of storage. Most of the graphs in this paper are similar in construction to Figure 2. Here we measure the impact of associativity in the base YAGS scheme. Each curve corresponds to a fixed degree of associativity. For each degree of associativity, we vary the number of tag bits from 2 to 18 in increments of 2. By changing the tag length, the size of the predictor is also changed so the horizontal axis shows the predictor size in Kibits. The performance mea4

sure is mispredicts per kilo-instructions (MPkI): the number of mispredicted conditional branches divided by the total number of branches in the trace. Results are averaged over all traces. 6.0 1-way 2-way

5.8 5.6

4-way 8-way

5.4

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

Figure 2: Impact of associativity in the base YAGS scheme. Figure 2 shows that associativity in the baseline YAGS scheme is of little importance. Clearly, adding LRU bits costs more than what can be gained and increasing associativity beyond 2-way set-associativity decreases accuracy. The loss of accuracy is due to the shortening of the global history length and the loss of correlation: when doubling associativity, the number of sets is halved to maintain the same predictor size, so the index in the direction caches is one bit shorter [EM98]. We will show in this paper that associativity does have a positive impact when certain enabling modifications to the baseline YAGS scheme are made. The impact of associativity also varies with the tag length, which is why the tag length is varied in all graphs.

4.1

The Unified Direction Prediction Cache

First, we analyze the benefit of using a unified direction prediction cache. Figure 3 shows that the unified prediction cache (label ’unified’) performs a little better than the base YAGS scheme (label ’base’), up to 1.5% for large tags. The unified prediction cache breaks down for 2-bit tags, but performance is bad in the baseline YAGS also with such short tags.

5

6.0 5.8

base

5.6

unified unified+hist len

5.4

unified+BT unified+BT+hist len

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

Figure 3: Comparison between split T/NT direction tables and a unified direction cache for direct mapped tag caches. By unifying the direction prediction caches, an extra index bit has become available. In the ’unified’ case, this bit is filled with the least significant PC-bit. However, this bit could also be hashed with a global branch history bit provided the global branch history length is increased by 1. When doing so, correlation with previous branch outcomes is increased and the unified cache is now 3.1% more accurate. The branch direction may potentially depend on the prediction of the choice PHT or the backing predictor. If this were true, then this would be in the favour of the base YAGS. To check this hypothesis, we evaluate a scheme with a unified prediction cache where the bias predicted by the backing predictor is added as a tag bit to each entry in the direction cache. Hereto, the PC tag is shortened by 1 bit. Adding this tag bit decreases accuracy (label ’unified+BT’) although lengthening the global branch history to use more correlation (label ’unified+BT+hist len’) can restore accuracy to about the same level as the base YAGS. This shows that the ability to store a different prediction depending on the predicted bias is not an important property of YAGS. Increasing associativity has a strongly negative effect in the base YAGS. This effect, however, disappears with a unified cache. Figure 4 shows accuracy of the base and unified schemes (with increased history length) for various degrees of associativity. In the unified scheme, high associativity has

6

6.0 base, 1-way base, 4-way unified, 1-way unified, 4-way

5.8 5.6 5.4

base, 2-way base, 8-way unified, 2-way unified, 8-way

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

Figure 4: Comparison between split T/NT direction tables and a unified direction cache for varying associativity in the tag cache. a much smaller negative impact on accuracy provided that the tags are long enough. The size impact (for LRU bits) remains, of course.

4.2

Tagging Schemes

Eden and Mudge [EM98] expect that adding bits of the global history to the tag is beneficial to compensate for the small amount of correlation exploited by YAGS. However, they do not present simulation results of this idea. We present results for this, as well as for other tagging schemes. The tagging schemes make use of the branch address (PC), the global branch history (GBH) and their exclusive or (XOR). To define the tagging schemes, we use i for the number of index bits to compute and t for the number of tag bits. P C[a : b] stands for the bits of the branch address starting from the bit at position a and ending in the bit at position b (Table 2). E.g., in the PC scheme, the index is computed as a gshare hash of the least significant PC bits and the most recent GBH bits. Note that the most recent GBH bit is XOR-ed with the least significant PC bit in the hash. The tag is again the least significant PC bits. The GBH scheme defines the index in the same way as the PC scheme, but tags the entries with less recent GBH bits. In the PCˆGBH scheme, the idea is to construct one big gshare hash of i + t bits and to split it into an index part and a tag part. The impact of the tags is investigated for base and unified YAGS with 7

Table 2: Definition of indexing and tagging schemes. The variable i denotes the number of index bits to compute and the variable t denotes the number of tag bits. Name PC GBH PCˆGBH PCˆGBH[0:k]

Function Index Tag Index Tag Index Tag Index Tag

Definition P C[0 : i − 1] ⊕ GBH[i − 1 : 0] P C[0 : t − 1] P C[0 : i − 1] ⊕ GBH[i − 1 : 0] GBH[i : i + t − 1] P C[0 : i − 1] ⊕ GBH[t + i − 1 : t] P C[i : i + t − 1] ⊕ GBH[t − 1 : 0] P C[0 : i − 1] ⊕ GBH[k + i − 1 : k] P C[i : i + t − 1] ⊕ [0 . . . 0, GBH[k − 1 : 0]]

base: tag=PC

6.0

base+tag=GBH

5.8

base+tag=PC^GBH

5.6

unified+tag=PC unified+tag=GBH

5.4

unified+tag=PC^GBH

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

Figure 5: Computation of the tag, based on branch address (PC) and global branch history (GBH) in a direct mapped tag cache. associativity ranging from 1 to 4 (Figures 5 and 6). Tagging the direction cache with the GBH is not more accurate than tagging with the branch address. In fact, after a certain tag length, the predictor becomes less accurate. This happens because the cached information becomes too specific, as the long history does not repeat itself frequently enough. The entries are replaced before they can be reused, making the predictor frequently fall

8

base: tag=PC

6.0

base+tag=GBH

5.8

base+tag=PC^GBH

5.6

unified+tag=PC

5.4

unified+tag=GBH unified+tag=PC^GBH

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

(a) Two-way set-associative caches base: tag=PC

6.0

base+tag=GBH

5.8

base+tag=PC^GBH

5.6

unified+tag=PC unified+tag=GBH

5.4

unified+tag=PC^GBH

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

(b) Four-way set-associative caches

Figure 6: Computation of the tag, based on branch address (PC) and global branch history (GBH) in set-associative tag caches. back on the backing predictor. As such, the prediction accuracy approaches that of the backing predictor. This effect lessens for higher degrees of associativity, showing that it is caused by aliasing. Tagging with the XOR of branch address and global branch history is beneficial only for tag lengths between 4 and 8 bits, depending on the degree 9

base,1-way

6.0

unified,1-way

5.8

base,2-way

5.6

unified,2-way

5.4

base,4-way unified,4-way

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

(a) PCˆGBH hashing 1-way,PC^GBH

6.0

1-way,PC^GBH[0:3]

5.8

2-way,PC^GBH

5.6

2-way,PC^GBH[0:5] 4-way,PC^GBH

5.4

4-way,PC^GBH[0:5]

MPkI

5.2 5.0 4.8 4.6 4.4 4.2 4.0 5

7

9

11

13

15

Size (KiB)

(b) PCˆGBH[0:k] hashing in unified YAGS

Figure 7: Computation of the tag, based on branch address (PC) and global branch history (GBH) across the degree of associativity. of associativity. This phenomenon is caused by the same effect that makes the GBH tags less useful for longer tags: By incorporating more GBH bits in the tag, then, after some point, tag matches become unlikely. Thus, more predictions are taken from the inaccurate bimodal backing predictor causing the accuracy to drop. To avoid this artifact, we evaluated tags composed 10

base

7.0

unified,1-way,PC,tag=6 6.5

unified,2-way,PC^GBH,tag=6

6.0

unified,4-way,PC^GBH,tag=6 unified,4-way,PC^GBH,tag=8

MPkI

5.5 5.0 4.5 4.0 3.5 3.0 0.1

1

10

100

1000

Size (KiB)

Figure 8: Computation of the tag for varying predictor size. of the XOR of the branch address and at most 4 or 6, respectively GBH bits (Figures 7(a) and 7(b), respectively). The accuracy now reaches the optimal accuracy of the PCˆGBH scheme. Increasing the tag length does not degrade accuracy, but it also does not increase accuracy. Note that the unified cache benefits strongly from having one index bit more than the split taken/not-taken caches when PC tags are used. This benefit diminishes with the other tagging schemes, as the tags include the branch history bits with which correlation occurs. The PCˆGBH tagging scheme increases prediction accuracy significantly over the baseline direct mapped YAGS. This improvement is mostly due to the contents of the tag, and only for a small part caused by unifying the taken and not-taken caches. The improvement is 5.25% in a 2-way set-associative unified YAGS with 6-bit tags and reaches 9% in a 4-way set-associative unified YAGS with 8-bit tags. The resulting predictors are, however, larger: 2 Kibits and 14 Kibits, respectively. To put the size increase into perspective, we varied the total predictor size, reserving roughly half of the total budget to the bimodal predictor and the remainder to the direction cache. We evaluate the baseline YAGS, a direct mapped unified YAGS with 6-bit PC tags, a 2-way and a 4-way set-associative unified YAGS with 6-bit PCˆGBH tags and a 4-way setassociative unified YAGS with 8-bit PCˆGBH tags (Figure 8). The direct mapped unified YAGS outperforms the baseline YAGS with 2% to 7.5%, except at 128 Kibits, where performance happens to be low for the twolf 11

trace. The 2-way and 4-way set-associative unified YAGS are slightly more accurate than the direct mapped unified YAGS. However, the 4-way setassociative unified YAGS with 8-bit tags outperforms all schemes for all sizes, improving over the baseline YAGS with 3.16% in the smallest predictor to 10.4% in the largest predictor evaluated.

4.3

Analysis over Wide Bit Budget Range

The performance of YAGS depends strongly on the size of the backing predictor. In fact, one has to adjust the fraction of space allocated to the direction prediction caches differently for every bit budget [EM98]. We varied the bit budget of the backing predictor as well as the bit budget of the direction prediction caches for the baseline YAGS configuration. Predictors with the same size of direction prediction caches are grouped in a curve (Figure 9(a)). For each curve, the backing predictor varies over 256, 1K, 4K, 16K to 64K entries. The curves for a fixed backing predictor size drop quickly for increasing bimodal predictor size and keep the same level after that. Thus, making the backing predictor large is not beneficial. E.g., for a direction prediction cache of 1K entries, counting 8 Kibits, it suffices to have a 1K-entry backing predictor, counting 2 Kibits. Making the backing predictor larger does not significantly improve performance. Thus, it is best to allocate most resources to the direction prediction caches. Some configurations are clearly inappropriate, as there exist configurations with a lower bit budget and better performance. When there is no known configurations with either a lower bit budget or better performance, then the configuration is Pareto-optimal. The Pareto-optimal configurations for the baseline YAGS are collected from Figure 9(a) and are shown on a continuous curve in Figure 9(b). The Pareto-optimal curves for the unified YAGS, which were studied in the previous section, are constructed in a similar way. These curves show that the unified YAGS improve performance over the baseline YAGS for all bit budgets. The 4-way set-associative unified cache with 8-bit PCˆGBH tags performs best over all bit budgets. It performs around 10% more accurate than the baseline YAGS: e.g., 10.6% more accurate at 1.7KiB, 8.7% more accurate at 6.8KiB and 9.5% at 32KiB. In each case, the unified YAGS is smaller than the baseline YAGS. The 2bcgskew predictor [Sez03] remains more accurate than YAGS, but not by much (Figure 9(b)). At around 8 KiB, 2bcgskew is about 5% more accurate than the 4-way set-associative unified YAGS with 8-bit PCˆGBH tags.

12

64

10.0

256 9.0

1K 4K

8.0

MPkI

16K 64K

7.0 6.0 5.0 4.0 3.0 0.1

1

10

100

Size (KiB)

(a) Baseline YAGS. Labels indicate the total number of entries in direction prediction caches. base

6.0

unified,1-way,PC,tag=6 unified,4-way,PC^GBH,tag=6 unified,4-way,PC^GBH,tag=8

5.5

unified,2-way,PC^GBH,tag=6 2bcgskew

MPkI

5.0

4.5

4.0

3.5

3.0 1

10

100

Size (KiB)

(b) Pareto-optimal configurations.

Figure 9: The influence of the bit budget fraction allocated to the backing predictor.

5

Related Work

The cache in YAGS can be viewed as an overriding predictor. This is especially valid for the unified YAGS proposed in this paper. The bimodal 13

predictor is followed by default, except when the cache has better information, then the cache overrides the bimodal predictor. Overriding predictors can significantly increase prediction accuracy. Other examples are loop termination predictors [GZ04, SC00], the rare event predictor [HSS99] and McFarling’s serial predictor [McF02]. YAGS also has interesting connections with the prediction by partial matching (PPM) predictor [CCM96]. PPM predictors associate tags with state machines that track the branch direction. The PPM predictor of Michaud [Mic04a, Mic04b] can be viewed as a special case of the YAGS predictor where the tag cache is 8-way skewed-associative. The hash function for each bank hashes the program counter with an increasing number of global history bits. This progression of history lengths is crucial to the concept of the PPM predictor [Mic04a]. Furthermore, this implementation of PPM uses additional heuristics to allocate entries in one or more banks on a misprediction and tag miss and uses more advanced partial update heuristics than the YAGS predictor.

6

Conclusion

We studied the YAGS (Yet Another Global Scheme) conditional branch predictor. We presented the unified YAGS, where the direction prediction caches are unified in a single cache. This has the advantage that one more bit of global history can be used in the indexing of the cache, increasing branch history correlation. The unified YAGS is 3.1% more accurate when tags consist solely of branch address bits. When global branch history is included in the tags, then unifying the direction prediction caches has a much smaller effect. Although previous studies have claimed that increasing associativity of the direction prediction cache is not beneficial to prediction accuracy [EM98], we have shown that associativity is worthwhile, but only in particular circumstances. The best direct mapped configuration uses branch address tags, while the best 2-way and 4-way set-associative configurations use the exclusive-OR of the branch address and global branch history as tag. Furthermore, the 4-way set-associative unified YAGS benefits from increasing the tag-size from 6 to 8 bits. The 2-way set-associative unified YAGS is 5.25% more accurate than the baseline direct mapped YAGS. The 4-way set-associative unified YAGS with 8-bit tags is around 10% more accurate than the baseline YAGS for bit budgets in the range of 1 to 100 KiB. These results show that a significant gain in accuracy can be obtained

14

by changing the tag length and tag content when increasing associativity. The impact on performance of associativity, the content of the tags (branch address combined in some way with global branch history) and unified direction prediction caches is strongly interdependent. Therefore, a thorough analysis of the design space of the branch predictor is necessary in order to correctly understand the effect of each of these parameters.

References [CCM96] I.-C. K. Chen, John T. Coffey, and Trevor N. Mudge. Analysis of branch prediction via data compression. In ASPLOS-VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 128–137, October 1996. [EM98]

A.N. Eden and T. Mudge. The YAGS branch prediction scheme. In MICRO 31: Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 69–77, November 1998.

[GZ04]

H. Gao and H. Zhou. Adaptive information processing: An effective way to improve perceptron predictors. In 1st Journal of Instruction-Level Parallelism Championship Branch Prediction, page 4 pages, December 2004.

[HSS99]

T.H. Heil, Z. Smith, and J.E. Smith. Improving branch predictors by correlating on data values. In MICRO 32: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 28–37, November 1999.

[LCM97] C.-C. Lee, I.-C.K. Chen, and T. Mudge. The bi-mode branch predictor. In MICRO 30: Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pages 4–13, December 1997. [McF93] S. McFarling. Combining branch predictors. Technical Report WRL TN-36, Western Research Laboratory, June 1993. [McF02] S. McFarling. Branch predictor with serially connected predictor stages for improving branch prediction accuracy. US Patent 6374349, April 2002. [Mic04a] P. Michaud. Analysis of a tag-based branch predictor. Technical Report PI-1660, IRISA/INRIA, November 2004. [Mic04b] P. Michaud. A PPM-like, tag-based predictor. In 1st Journal of Instruction-Level Parallelism Championship Branch Prediction, page 4 pages, December 2004. [MSU97] P. Michaud, A. Seznec, and R. Uhlig. Trading conflict and capacity aliasing in conditional branch predictors. In ISCA ’97: Proceedings of the

15

24th Annual International Symposium on Computer architecture, pages 292–303, June 1997. [SC00]

T. Sherwood and B. Calder. Loop termination prediction. In Proceedings of the 3rd International Symposium on High-Performance Computing (ISHPC2k), pages 73–87, October 2000.

[Sez03]

A. Seznec. An optimized 2bcgskew branch predictor, September 2003.

16