Evaluation of Cache Assisted Multithreaded Architecture - Semantic ...

6 downloads 0 Views 257KB Size Report
lows: while the CPU is executing using State 0, the le of State 1 is loaded with state of a ready thread from the state cache. On a miss in the L2 cache the CPU.
Evaluation of Cache Assisted Multithreaded Architecture  Robert E. Rinker, Roopesh Tamma, and Walid A. Najjar Department of Computer Science Colorado State University Fort Collins, CO 80523 frinkerr,tamma,[email protected]

2 The Architectural and Execution Models

1 Introduction

The basic execution model of the CAMP is an extention to the concept of the dribbling registers proposed in [12]: it provides hardware support for context switching between threads on a miss in the L2 cache. As shown in Figure 1, the processor contains a duplicated hardware processor state which includes the regsiter le, the reorder bu er, the register renaming map table etc. In addition to on-chip instruction and data caches, the CAMP also includes a state cache. The state cache contains the copies of executions states of non-active threads. An example of the operation of the CAMP is as follows: while the CPU is executing using State 0, the le of State 1 is loaded with state of a ready thread from the state cache. On a miss in the L2 cache the CPU is switched to State 1 and the data in le of State 0 is saved to its state cache and the le is reloaded with the state of another ready thread. Since the context switching operation is done o -line (relative to normal CPU activity) it will not slow down the processor as long as the average run length of a context (i.e. time between consecutive L2 misses) is larger than the context switch time (save time plus restore time) and there is an available ready thread. Unlike the Tera MTA [3] or the SMT [15] multithreaded execution models that provide processor support for each thread, the CAMP architecture limits the processor support to two states that are switched among the ready threads.

Solutions to the increasing memory latency problem fall into two categories. Latency avoidance uses caches to keep the most commonly accessed locations within immediate reach of the processor. A multilevel cache system can reduce the miss rate to 1% or less; even so, cache misses are expensive, producing processor idle times that can exceed 50% [15]. Latency tolerance tries to hide the latency by switching to another execution thread [4]. A relatively small number of threads is required to hide memory latency; however, the processor must maintain the state of each thread, thereby increasing its complexity. In this paper we propose and evaluate a novel multithreaded processor organization that relies on cache support to maintain the state of each thread: the Cache-Assisted Multithreaded Processor (CAMP) relies on a modi ed processor organization and a twolevel cache hierarchy (L1 + L2). The objective of the CAMP architecture is to provide an ecient hardware support mechanism for switching between ready threads within a processor with minimal and inexpensive changes to a basic processor organization. The CAMP architecture and execution models are described in Section 2. In Section 3, we describe the experimental set-up for the evaluation of CAMP. Section 4 presents the results of simulations of each of our architectural models. Finally, in Section 5 we present our conclusions.

2.1 Processor and Memory Architecture The model parameters of the CAMP architecture are largely based on those of the SUN Ultra-SPARC II processor. The cache memory system characteristics are shown in Table 1. This base architecture

 This work is supported in part by DARPA Contract Number DABT63-95-0093

1

Figure 1: Architectural model of the CAMP with cache support for four threads. is augmented by adding a separate I-cache, D-cache and a state cache for each thread. The 512 KB L2 cache is uni ed and shared by four threads. The duplicated processor state includes both register les (integer and oating-point), the program counter, the processor state registers, the rename registers and the register map tables. In this paper we assume that the hardware state for each thread consists of 4K bits. Assuming a bandwidth of 128 bits/cycle between the state caches and the state les, it would take 32 cycles to e ect a complete state transfer (state save or restore). L1 Cache - one set per thread I-Cache Size 16KB each Block Size 32 Bytes Associativity 2-Way Set Associative D-Cache Size 16KB each Block Size 32 Bytes (2-16 Byte Subblocks/line) Associativity Direct Mapped, Write-through, Non-Write Allocate L2 Cache - Uni ed Size O -chip { 512KB used for this paper, modelled as one 128KB cache per thread Block Size 64 Bytes Associativity Direct Mapped Write-back, write allocate Hit time 3 cycles

Table 1: Memory characteristics of the CAMP Processor as modelled in this paper

2.2 Context Switch Schemes In this paper we compare three di erent context switch schemes: 

Save All/Restore All (SA/RA) - In this



Save Dirty/Restore All (SD/RA) - In this



Save Dirty/Restore OnDemand (SD/ROD) - Only modi ed registers

scheme the entire state le is saved on a context switch and the entire state of a new thread is loaded. scheme each architecture register is tagged with a dirty bit which is set when the register is written to. The dirty bits are reset whenever a new state is loaded to the le. Only those registers which have been modi ed since the last cache miss are saved to the state cache. The entire state of a new thread is loaded. are saved. Only those components of the state that are essential to the execution are loaded in the state le. These include: the PC, the processor state registers and the rename map table. Architecture and rename registers are loaded dynamically when accessed. Each register has a valid bit that is reset at every context switch. This bit is set whenever the register is written, either by an instruction or via a dynamic reload.

Figure 2: The three context switch schemes: (a) SA/RA, (b) SD/RA, and (c) SD/ROD These three schemes are shown pictorially in Figure 2. The SA/RA scheme is the simplest in terms of implementation but has the largest overhead since all registers are saved and restored. The ineciency of this scheme stems from the fact that the number of registers actually used between cache misses is a small fraction of the total. The SD/RA scheme reduces the state saving time since only a fraction of all the registers are rede ned between L2 cache misses. The SD/ROD scheme eliminates most of the time spent during reloading, but requires that the CPU wait for a register to be reloaded when it is accessed for the rst time after a cache miss. This moves the register restore time from an o -line operation, incurred during register reloading, to an online one, incurred during thread execution; however, only those registers that are accessed need be loaded, thus reducing the total reload time.

3 Experimental Set-Up The three distributions that a ect the performance of the context switch schemes are:  The lengths of the run segments: this is the number of instructions between consecutive L2 misses.  The number of registers de ned within a run segment.  The number of registers accessed before being rede ned within a run segment. These distributions are measured using shade [14] (a Sparc instruction trace/simulator tool) on the

SPEC95 benchmarks [13]. The benchmarks were executed sequentially (non-parallel). Thus, the term thread as used in this paper is a single process. However, the more important quantum of measurement in this study is the run segment length, which would not change if each benchmark were parallelized and scheduled as multiple threads. Therefore, the results are valid for multithreaded processes as well. Graphs of the results for L2 cache miss behavior, showing a distribution of lengths of the run segments, are shown in Figure 3. For the SPECint95 benchmarks, approximately 20% of the run segments are less than 30 instructions in length, and 25% are greater than 300. The SPECfp95 results exhibit a slightly different pro le, with 40% and 10% of the run segments falling in the two areas, respectively. This di erence is also re ected in the average run segment lengths: 294 for SPECint95, 149 for SPECfp95. While the median run segment length between L2 misses is typically around 100-400 cycles (corresponding to a miss rate of 0.25{1.0%), the distribution includes very short run segments (i.e., compulsory misses at the start of program execution, or a program with poor spacial locality), and very long run segments (i.e., programs with tight loops). Such variations in run length dramatically a ect results: short run segments do not provide enough execution time to fully hide the context switch time of the o -line thread, while long run segments do not bene t from multithreaded execution. Figure 4 shows the register usage distributions. The number of registers actually de ned and accessed in a given run segment is a relatively small fraction of the total number of registers, with an average of 13 reg-

Run Segment Length Distributions SPECfp95

Run Segment Length Distributions SPECint95 70%

50% 45%

60%

% of total Run Segments

% of Total Run Segments

40% 35% 30% 25% 20% 15%

50%

40%

30%

20%

10%

10% 5%

< 1000

801-900

901-1000

701-800

601-700

501-600

401-500

301-400

201-300

1-100

< 1000

901-1000

801-900

701-800

601-700

501-600

401-500

301-400

201-300

101-200

1-100

101-200

0%

0%

Run Segment Lengths

Run Segment Lengths

Figure 3: Thread run length distributions for SPECint95 & SPECfp95. Distribution of Number of Regs Used per Run Length - SPECint95

Distribution of Number of Regs used Per Run Length - SPECfp95

16%

7% Defined

6% Defined

Accessed

No. of Registers Used

39

36

33

30

27

21

18

30

27

24

21

18

15

12

9

0%

6

0%

3

1%

0

2%

15

2%

12

4%

3%

9

6%

4%

6

8%

3

10%

Accessed

5%

0

Percent of Run Segments

Percent of Run Segments

12%

24

14%

No. of Registers used

Figure 4: Distributions of number of registers de ned and number of registers referenced (before being rede ned), within a run segment.

Parameter Context switch time Instructions per cycle (IPC) Processor to state cache bandwidth L2 miss penalty On-Demand register load penalty (for RD/ROD)

Default Value 1 cycle 2 IPC 128 bits/cycle 50 cycles

isters de ned, 9.5 registers accessed (out of 64) for SPECint95, and around 15 and 16.5 for SPECfp95. This would indicate that the Save Dirty/Load on Demand model would perform well, and that the Save Dirty/Restore All would do better than the Save All/Restore All model.

4 Performance Evaluation Discrete-event simulation models were developed for each of the three swapping schemes, incorporating the run segment length and register count distributions. The default architectural parameters are summarized in Table 2. The context switch time is the overhead of switching between state les. The processor to state cache bandwidth is the rate at which the state les can be saved/restored. The on-demand register load penalty is the additional access time overhead in the RD/ROD scheme when a register is dynamically loaded. The performance of the schemes is compared to two other models: 



all threads

1 cycle

Table 2: Parameters and default values used in the discrete-event simulation models

A non-threaded model (BASE), which simulates the execution of an conventional superscalar processor. It has the same IPC as the CAMP models. A multithreaded architecture model (MT) with hardware support for each thread and an execution that switches at every cycle among all the ready threads. This model is similar, in concept, to the Tera MTA (albeit with fewer contexts). We assume no context switching overhead for this model. Since this model has the lowest overhead, its performance, serves as a high water mark to the CAMP models.

The rst performance measure is the speedup, de ned by Time Speedup = Base MT Time

X X + X =

where MT Time is the multithreaded execution time. Base Time = Thread CPU Time all threads

MT Time

all threads

Thread Memory Time

Thread CPU Time

+ Total CPU Idle Time + Total Context Switch Time Figure 5 shows the speedup vs. number of threads for each of the models (including MT). The results for SPECint95 and SPECfp95 are somewhat di erent. Since the average run segment length of the SPECint95 programs is nearly double that of SPEC95fp, there is less idle time due to cache misses; thus, the opportunity for speedup is considerably less. Figure 6 shows the processor idle time for the four models. The idle time for a single thread is also shown { this value represents the overhead incurred by the system for cache misses, and also de nes the maximum speedup that can be achieved by multithreading. The idle times are all nearly double for SPECfp95 as compared to SPECint95, ualso consistent with the di erence in run segment lengths. For SA/RA, processor idle time does not fall below about 25% for SPECfp95, 10% for SPECint95; this number represents the overhead, relative to thread execution time, incurred by the context switching. The high cost of a context switch for this scheme cannot be fully hidden by the L2 miss latency; the shorter run segment lengths of SPECfp95 hide less of the cost. SD/RA, which reduces context switch time by saving only dirty registers, does somewhat better, with context switch overheads of 16% and 6%. The SD/ROD scheme achieves the best speedup of the CAMP schemes; the di erence between it and MT is relatively constant. The context switch overhead for SD/ROD consists of two parts: (1) the saving of modi ed registers, which occurs o -line relative to processor execution and can be hidden by the execution of the other thread, and (2) the restore-on-demand time, which always contributes to context switching overhead because it occurs during execution. The time required to save dirty registers is small relative to the average run segment length; this component likely contributes only a small amount to context switch overhead. However, the registers that must be restored always add to the overhead, creating the relatively constant di erence in performance between the two schemes.

Speedup vs Number of Threads SPECint95

Speedup vs Nnumber of Threads SPECfp95

1.35

1.7

1.3

1.6

MT SA/RA

1.2

SD/RA

1.15

SD/ROD

Speedup

Speedup

1.25

1.5

MT

1.4

SA/RA SD/RA

1.3

1.1

1.2

1.05

1.1

1

SD/ROD

1

2

3

4

5

6

7

8

2

3

4

Number of Threads

5

6

7

8

Number of Threads

Figure 5: Speedup vs no of threads for the three CAMP models, compared with MT. 20 18 16 14 12 10 8 6 4 2 0

CPU Idle Time vs Number of Threads SPECfp95 40 35 MT SA/RA SD/RA SD/ROD

% CPU Idle Time

% CPU Idle Time

CPU Idle Time vs Number of Threads SPECint95

30

MT

25

SA/RA

20

SD/RA

15

SD/ROD

10 5 0

1

2

3

4

5

6

7

1

Number of Threads

2

3

4

5

6

7

Number of Threads

Figure 6: CPU idle time vs no of threads for the three CAMP models, compared with MT. Speedup vs IPC SPECfp95

Speedup vs IPC SPECint95 2

1.4 1.35

1.5

SA/RA

1.2

SD/RA SD/ROD

1.15

Speedup

Speedup

1.3 1.25

SA/RA SD/RA

1

SD/ROD 0.5

1.1 1.05 1

0

1

2

3

4

1

2

3

4

Instruction per Cycle

Instructions per Cycle

Figure 7: Speedup vs IPC for a superscalar processor for the three CAMP models. Speedup vs Bandwidth SPECfp95

Speedup vs Bandwidth SPECint95 1.6

1.35

1.5 SA/RA

1.25

SD/RA

1.2

SD/ROD

1.15

Speedup

Speedup

1.3

1.4

SA/RA

1.3

SD/RA SD/ROD

1.2 1.1

1.1

1 64

128

256

Bandwidth

512

64

128

256

512

Bandwidth

Figure 8: Speedup vs state swapping bandwidth, for four threads, for the three CAMP models.

No appreciable gain in speedup occurs for any of the models beyond about four or ve threads. This or similar results is seen in other research { a relatively small number of threads can hide virtually all cache miss latency [1, 6, 15, 16]. Figure 7 shows the e ects of varying the instructions per cycle (IPC) on the results. An e ective IPC of 2 is typical for this class of superscalar processor. Increasing the IPC increases both the e ective main memory access time and the state save/restore times relative to the number of instructions being executed. The rst change will improve the multithreading results, since there is more miss latency to hide, while the second will reduce performance, because the state swapping time will increase relative to the number of instructions executed. The gure shows that as the effective IPC increases, the CAMP architecture becomes somewhat more e ective. Figure 8 shows the speedup results for processorto-state cache bandwidths of 64, 256, and 512 bits, in addition to the base bandwidth of 128 bits. It appears that a bandwidth of 256 bits is sucient to derive maximum performance from CAMP. As the bandwidth is increased, and therefore the cost of saving and restoring registers is reduced, the SD/RA overtakes SD/ROD in performance; this is because the RestoreOn-Demand scheme loads one register at a time, and therefore does not bene t from the increased bandwidth.

5 Related Work The Denelcor HEP [11], the rst commercial processor to use multithreading, and the later TERA MTA [3, 2], use ne grain multithreading, whereby the processor maintains a large number (128) of thread states. By being able to switch threads on every cycle, the processor can hide all memory (and other) latencies without the use of caches. By contrast, the CAMP architecture we propose maintains only two states, with more states contained in a state cache; it is a simpler and more conventional processor, but induces overhead when swapping states that is not entirely hidden by the execution of the second thread. Nonetheless, in terms of hiding memory latencies, we show that CAMP can achieve performance within a few percentage points of a TERA-type scheme. Hum and Gao [8, 10] proposed the idea of saving/restoring only the required registers, as determined by compile-time analysis. They used this idea to increase the performance of context switching on

communication latencies in the EARTH-MANNA processor. Eichemeyer et al. [6] report on the use of multithreading to hide the longer latencies associated with message synchronization. They assume a thread switch time of 10 cycles, without describing a mechanism for obtaining such a fast time. Soundararajan and Agarwal [12] use a similar idea to that described in this paper they call dribbling registers to hide communication latencies. They use a round-robin mechanism for selecting a ready-to-execute thread, and only initiate a register save/restore operation after determining that no currently loaded threads are available for execution. Their results show that the technique is e ective in hiding these longer latency events. The results reported here appear to be in line with their results; we extend their approach to hide cache miss latencies, and also explore the e ectiveness of several di erent register swapping policies. Tullsen, Eggers, and Levy [15, 9, 5] report on a technique they call simultaneous multithreading (SMT); earlier research on a similar scheme was done by Yamamoto and Nemirovsky [16]; further extensions which incorporate vector processing in the architecture are proposed by Espasa and Valero [7]. The functional units normally associated with a multiprocessor (several independent processors) are reorganized into a single processor which can simultaneously issue instructions to each of the functional units. Tullsen et. al. categorize the inability of a processor to utilize its resources into either horizontal waste, the inability to fully utilize the superscalar ability of the processor, or vertical waste, the inability to fully utilize all the functional units in a cycle due to a lack of sucient parallelism. Our proposed architecture is a less ambitious but simpler solution, involving only a single processor. It only hides memory latencies, found to be the largest component of vertical waste. The results reported in this paper, showing a speedup range of 1.18 to 1.51, seem to be in line with their results of 1.68, considering the di erence in the scope of solutions in the two studies.

6 Conclusion This paper presents the results of studies of a proposed modi cation to a conventional sequential processor that we call the Cache Assisted Multithreaded Processor (CAMP). This modi cation adds a register cache and context swapping hardware to a conventional single-threaded processor. CAMP works in

conjunction with the regular cache memory system to help the processor tolerate main memory access latencies. We propose three di erent cache swapping schemes { SA/RA (Save All/Restore All), SD/RA (Save Dirty/Restore All), and SD/ROD (Save Dirty/Restore on Demand), and compare their performance to a conventional processor and a multithreaded processor which does not have a state cache or swapping hardware. We found that CAMP produces a speedup from 1.18 (for SA/RA) to 1.51 (for SD/ROD) over a conventional processor.

References [1] A. Agarwal, R. Bianchini, D. Chaiken, K.L Johnson, D. Kranz, J. Kubiatowicz, B-H Lim, K. Mackenzie, and D Yeung. The MIT Alewife machine: Architecture and performance. In Int. Symp. on Computer Architecture, 1995. [2] G. Alverson, R. Alverson, D. Callahan, B. Koblenz, A. Porter eld, and B. Smith. Exploiting Heterogeneous Parallelism on a Multithreaded Multiprocessor. Tera Computer Company. [3] R. Alverson, D. Callahan, D. Cummings, and B. Koblenz. The tera computer system. In Int. Conference on Supercomputing, Jun 1990. [4] G.T. Byrd and M.A. Holliday. Multithreaded processor architectures. IEEE Spectrum, August 1995. [5] S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L Stamm, and D.M. Tullsen. Simultaneous multithreading: A platform for next-generation processors. IEEE Micro, pages 12{19, Sep/Oct 1997. [6] R.L. Eichemeyer, R.E. Johnson, S.R. Kunkel, M.S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Int. Symp. on Computer Architecture, pages 203{212, May 1996. [7] R. Espasa and M. Valero. Exploiting instructionand data-level parallelism. IEEE Micro, pages 20{27, Sep/Oct 1997. [8] Herbert H. J. Hum and Guang R. Gao. A novel high-speed memory organization for ne-grain multithread computing. In E. H. L. Aarts, J. van Leeuwen, and M. Rem, editors, PARLE '91 { Parallel Architectures and Languages Europe, volume I, number 505 in Lecture Notes in Computer Science, pages 34{ 51, Eindhoven, The Netherlands, June 10{13, 1991. Springer-Verlag. [9] J.L. Lo, S.J. Eggers, J.S. Emer, H.M. Levy, R.L Stamm, and D.M. Tullsen. Converting thread-level

[10] [11] [12] [13] [14] [15] [16]

parallelism to instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322{354, August 1997. O. Maquelin, H.J. Hum, and G.R. Gao. Costs and bene ts of multithreading with o -the-shelf risc processors. In EURO-PAR 95, pages 117{128. SpringerVerlag, August 1995. (Stockholm, Sweden). B.J. Smith. Architecture and applications of the hep multuiprocessor computer system. In SPIE, volume 298, Real Time Signal Processing, pages 241{248, 1981. V. Soundararajan and A. Agarwal. Dribbling registers: A mechanism for reducing context switch latency in large-scale multiprocessors. Technical Memo TM-474, MIT, 1992. The SPEC Corporation. The SPEC95 Benchmark Suite, 1995. Sun Microsystems, 2550 Garcia Ave, Mountain View, CA 94043. Introduction to SHADE, 1992. D.M. Tullsen, S.J. Eggers, and H.M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Int. Symp. on Computer Architecture, pages 392{ 403, June 1995. W. Yamamoto and M. Nemirovsky. Increasing superscalar performance through multistreaming. In L. Bic, A.P.W. Bohm, P. Evripidou, and J.L. Gaudiot, editors, IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques, PACT '95, pages 49{58, 1995.