Near-Optimal Padding for Removing Conflict Misses - CiteSeerX

2 downloads 0 Views 206KB Size Report
variable padding (b) Miss ratio for the Tomcatv and Swim loop nests for the Pentium. 4 L1 cache the bigger the miss ratio and the bigger the improvement that ...
Near-Optimal Padding for Removing Conflict Misses Xavier Vera1 , Josep Llosa2 , and Antonio Gonz´alez2 1

2

Institutionen f¨ or Datateknik, M¨ alardalens H¨ ogskola P.O. BOX 883, V¨ aster˚ as, 721 23, Sweden [email protected] Computer Architecture Department, Universitat Polit`ecnica de Catalunya-Barcelona Jordi Girona 1-3, Barcelona, 08034, Spain {josepll,antonio}@ac.upc.es

Abstract. The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as padding, which is a code transformation targeted to reduce conflict misses. This paper presents a novel approach to perform near-optimal padding for multi-level caches. It analyzes programs, detecting conflict misses by means of the Cache Miss Equations. A genetic algorithm is used to compute the parameter values that enhance the program. Our results show that it can remove practically all conflicts among variables in the SPECfp95, targeting all the different cache levels simultaneously.

1

Introduction

Memory performance is critical for the performance of current computers. Memory is organized hierarchically in such a way that the upper levels are smaller and faster. The uppermost level typically has a very short latency (e.g. 1-2 cycles) but the latency of the lower levels may be a few orders of magnitude longer (e.g. main memory latency may be around 100 cycles). Thus, techniques to keep as much data as possible in the uppermost levels are key to performance. In addition to the hardware organization, it is well known that the performance of the memory hierarchy is very sensitive to the particular memory reference patterns of each program. The reference patterns of a given program can be changed by means of transformations that do not alter the semantics of the program. These program transformations can modify the order in which some computations are performed or can simply change the data layout. Padding is an example of the latter family of techniques. Padding is based on adding some dummy elements between variables (inter-variable padding) or between elements of the same variable (intra-variable padding). Padding has a significant potential to remove cache misses. In fact, it can remove most conflict misses by changing the addresses of conflicting data, and some compulsory misses by aligning data with cache lines. However, finding the

Var0

(a)

Var1

Var2

Var0

(b)

Var1

P_Base0

(c)

Var0

(d)

Var0

Var1 Row0

P_Base2

Var1 Row1

Var1 Row0 P_Base1

Dim11

(e)

Var2

P_Base1

Var1 Row2 Var1 Row1

P_Dim10

Var1 Row2 P_Dim10

P_Dim10

Dim11

(f)

Var1 Rown

Var1 Row2 Var1 Row1 Var1 Row0

Dim10 P_Dim 10

Var1 Rown

Var1 Row2 Var1 Row1 Var1 Row0

Dim10

Fig. 1. Data layout: (a) before inter-variable padding, (b) after inter-variable padding (c) before padding, (d) after padding, (e) 2-D array, (f) 2-D array after intra-variable padding

optimal padding for a given program is a very complex task, since the options are almost unlimited and exploring all of them is infeasible. For very simple programs, the programmer intuition may help but in general, a systematic approach that can be integrated into a compiler and can deal with any type of program and cache architecture is desirable. This systematic approach requires the support of a locality analysis method in order to assess the performance of different alternatives. In this paper, we propose an automatic approach to perform both inter- and intra-variable padding in numeric codes, targeting any kind of multi-level caches. It is based on a very accurate technique to analyze the locality of a program that is known as Cache Miss Equations (CMEs) [6] and a genetic algorithm in order to search the solution space. Earlier, we have proposed techniques to estimate the locality of a possible solution in a very few seconds [2, 21], in spite of the fact that a direct solution to the CMEs is an NP problem. The proposed genetic algorithm converges very fast and although it does not guarantee that the optimal solution is found, we show that after padding, the conflict miss ratio of the evaluated benchmarks is almost negligible. Besides, comparing our method with previous frameworks that address padding [17, 19], it turns out that in 91% of the cases our approach yields better results. The rest of this paper is organized as follows. Section 2 presents the padding technique and its performance is evaluated in Section 3. Section 4 outlines some related work and compares our method with previous approaches. Finally, Section 5 summarizes the main conclusions of this work.

2

Padding

This section presents our method for guiding both inter- and intra-variable padding. In this paper we refer to the cache size of L1 (primary) cache as Cs . memi is the original base address of variable number i (V ari ) and P Basei stands for the inter-variable padding between V ari and V ari−1 . dimij stands for the size of the dimension j of V ari (Di is the number of dimensions) and Si is its size. P Dimij is the intra-variable padding applied to dimij , and P Si is the size of V ari after padding (see Figure 1). We define ∆i as P Si − Si . 2.1

Inter-variable padding

When inter-variable padding is applied only the base addresses of the variables are changed. Thus, padding is performed in a simple way. Memory variable base addresses are initially defined using the values given by the compiler. Then, we define for each memory variable V ari , a variable P Basei , i = 0 . . . k: 0 ≤ P Basei ≤ Cs − 1 Note that padding a variable is equivalent to modifying the initial addresses of the other variables (see Figure 1). Thus, after padding, the memory variable base addresses are computed as follows: BaseAddr(V ari ) = memi +

k≤i X

P Basek

k=0

2.2

Adding intra-variable padding

The result of applying both inter- and intra-variable padding is that all base addresses and sizes of every dimension of each memory variable may change. They are initially set according to the values given by the compiler. For each memory variable V ari , i = 0 . . . k we define a set of variables {P Basei , P Dimij }, j = 0 . . . Di 0 ≤ P Basei , P Dimij ≤ Cs − 1 After padding, memory variable base addresses are computed in the following way (see Figure 1):

+

BaseAddr(V ari ) = memi + Pk