A Decoupled Scheduled Dataflow Multithreaded Architecture ...

1 downloads 0 Views 48KB Size Report
Jim Smith[15] presented an architecture that separated memory accesses and execution of instructions. It is in- teresting to observe that this approach was ...
A Decoupled Scheduled Dataflow Multithreaded Architecturey Krishna M. Kavi Hyong-Shik Kim Joseph Arul Dept of Electrical and Computer Engineering University of Alabama in Huntsville Huntsville, AL 35899 fkavi, hskim, [email protected] Abstract In this paper we propose a new approach to building multithreaded uniprocessors that become building blocks in high-end computing architectures. Our innovativeness stems from a multithreaded architecture with non-blocking threads where all memory accesses are decoupled from the thread execution. Data is pre-loaded into the thread context (registers), and all results are post-stored after the completion of the thread execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and post-stores, and to control the allocation of hardware thread contexts to enabled threads. This separation facilitates for achieving high locality and minimizing the impact of distribution and hierarchy in large memory systems. The non-blocking nature of threads eliminates the need for thread switching, thus improving the overhead in scheduling threads. The functional execution paradigm eliminates complex hardware required for scheduling instructions for modern superscalar architectures. We will present our preliminary results obtained from Monte Carlo simulations of the proposed architectural features.

1 Introduction Multithreading has been touted as the solution to the ever increasing performance gap between processors and memory systems (i.e., the memory wall). The memory latency incurred by an access to memory can be tolerated by performing useful work on a different thread. While there is no single best approach to multithreading applications, there is a consensus that on conventional superscalar based architectures, using conventional single threaded or coarsegrained multithreading models, the performance peaks for a small number of threads (2–4). This implies that adding y This work is supported in part by the following NSF grants: MIPS 9796310, EIA 9729889 and EIA 9895216.

Ali R. Hurson Dept of Computer Science and Engineering Pennsylvania State University University Park, PA 16802 [email protected]

more pipelines, functional units or hardware contexts is not cost effective, since the instruction issue width is limited by the available parallelism (viz., 2–4). It is our contention that the advantages of superscalar and wide-issue architecture can only be realized by finding an appropriate multithreaded model and implementation to achieve the best possible performance. We believe that the use of non-blocking and fine-grained threads will improve the performance of multithreaded systems for much larger number of threads than 4. In our ongoing research, we have been investigating architectural innovations for improving the performance of multithreaded dataflow systems[9]. This paper describes a new architecture that utilizes decoupled memory/execution units, and scheduling of dataflow instructions somewhat like control-flow execution models. We will present our preliminary results.

2 Decoupling Memory Accesses From Execution Pipeline Jim Smith[15] presented an architecture that separated memory accesses and execution of instructions. It is interesting to observe that this approach was advanced in an attempt to overcome the ever-increasing processor-memory communication cost. Since then the memory latency problem was alleviated by using cache memories. However, increasing cache capacities, while consuming an increasingly large silicon area on processor chips, have only resulted in diminishing returns. The gap between processor speed and average memory access speed is once again the major limitation in achieving high performance. Decoupled architectures may yet present a solution in leaping over the “memory wall.” In addition, combining the decoupled architecture with multithreading allows for a wide-range of implementations for next-generation architectures. In this section we describe two multithreaded architectures that support decoupled memory accesses (Rhamma and PL/PS). In the next section we show how the decoupled access/execution units can be utilized with our new dataflow architecture.

Memory Processor IF

ID

OF

EX/WB

Data Cache

Scoreboard Instr Cache Register Contexts

IF

ID

OF

EX/WB

Execute Processor

Figure 1. Rhamma Processor

2.1 Rhamma Processor A multithreaded architecture (called Rhamma) that implements the decoupled memory access/execution was designed in Germany[6]. Figure 1 shows the overall structure of Rhamma processors. Rhamma used two separate processors: A Memory Processor that performs all Load and Store instructions and an Execution Processor which executes other instructions. A single sequence of instructions (thread) is generated for both processors: when a memory access instruction is decoded by the Execution Processor, a context switch is utilized to return the thread to the Memory Processor; and when the Memory Processor decodes a non-memory access instruction, a context-switch causes the thread to be handed over to the Execute Processor. Threads are blocking and additional context switches due to data dependencies may be incurred during the execution of a thread.

2.2 PL/PS Architecture Another multithreaded architecture that uses a decoupled memory accesses from execution can be found in [10]. In this architecture, threads are non-blocking, and all memory accesses are done by the Memory Processor, which delivers enabled threads to the Execute Processor. Each thread is enabled when the required inputs are available and all operands are pre-loaded into a register context. Once enabled, a thread executes to completion without blocking where the instructions belonging to a thread will execute on the Execute Processor. The results from completed threads are post-stored by the Memory Processor.

3 Scheduled Dataflow Even though the dataflow model and architectures have been studied for more than two decades and held the promise of an elegant execution paradigm with the ability to exploit inherent parallelism in applications, the actual implementations of the model have failed to deliver the promised performance. Most modern processors

have brought the execution engine closer to an idealized dataflow engine to achieve high performance albeit by utilizing complex hardware (e.g., instruction scheduling, register renaming, out-of-order instruction issue and retirement, non-blocking caches, branch prediction and predicated branches). It is our contention that such complexities can be eliminated if a more suitable implementation of the dataflow model can be discovered. We feel that the primary limitations of the pure dataflow model that prevented commercially viable implementations are: (a) Too fine-grained (instruction level) multithreading (b) Difficulty in using memory hierarchies and registers (c) Asynchronous triggering of instructions Many researchers have addressed the first two limitations of dataflow architectures [9, 16, 17, 18]. There have been several research projects that demonstrated how coarser grained threads can be utilized within the dataflow execution model. The benefits of cache memories within the context of Explicit Token Store (ETS) dataflow paradigm were presented in [9]. In this section we propose a new dataflow architecture that addresses the third limitation by deviating from the asynchronous triggering of dataflow instructions, and by scheduling instructions for synchronous execution. The new model also decouples all memory accesses from the thread execution (like PL/PS) to alleviate memory latencies to further exploit multithreading. There have been several hybrid architectures proposed where the dataflow scheduling was applied only at thread level (i.e., macro-dataflow) with conventional control-flow instructions comprising threads (e.g., [5], [7], [14]). In such systems, the instructions within a thread do not retain functional properties, and introduce side-effects, WAW and WAR dependencies. Lacking dataflow properties at instruction level requires complex hardware for the detection of data dependencies and dynamic scheduling of instructions. In our system, the instructions within a thread still retain dataflow (functional) properties, and thus eliminate the need for complex hardware. The results (or data) flows from instruction to instruction and each instruction specifies a location for the data to be stored. This is contrary to control flow model, where the results (or data) are stored in locations with no specific connection to a destination instruction – the flow of data only defines control points. Our deviation in this proposed Scheduled Dataflow system from pure dataflow is a deviation from data driven (or token driven) models that are traditionally used in “pure” dataflow systems. Our system is “instruction driven” where a program counter type sequencing is used to execute instructions.

3.1 Overview of Scheduled Dataflow Architecture We feel that it should be possible to define a dataflow architecture that executes instructions in a prescribed order instead executing them as soon as data is available. Compile

Opcode

Offset(R)

Dest-Instr-1 and Port

Dest-Instr-2 and Port

PC

Context

Instruction Fetch

(a) ETS Instruction Format

Instruction & Frame Memory Instr. Cache

I-Structure Memory

Opcode

Offset(R)

Dest-Data-1 and Port

Dest-Data-2 and Port

Operand Fetch

Operand Cache

(b) Scheduled Dataflow Instruction Format

Figure 2. Instruction Formats

time analysis on the source program can be used to define an expected order in which instructions may be executed (even though data was already available for these instructions).

3.2 Instruction Formats Before describing the architecture of Scheduled Dataflow, it is necessary to understand the instruction format. Our architecture is derived from Explicit Token Dataflow system [13, 9] where each instruction specifies a memory location by providing an offset (R) with respect to an activation frame pointer (FP). The first data token destined to the instruction will be stored in this memory location, waiting for its match. When a matching data token arrives for the instruction, the previously stored data is retrieved, and the instruction is immediately scheduled for execution. The result of the instruction is converted into (up to) two tokens by tagging the data with the address of its destination instruction (IP). The format of the ETS instructions is shown in figure 2(a). Instructions for Scheduled Dataflow differ from ETS instructions only slightly (figure 2(b)). In ETS, the destinations refer to the destination instructions (i.e., IP values); in Scheduled Dataflow the destinations refer to the operand locations of the destination instructions (i.e., offset value into activation frames or register contexts). This change also permits the detection of RAW data dependencies among instruction in the execution pipeline and the use of result forwarding so that results from an instruction can be sent directly to dependent instructions. The result forwarding is not applicable in ETS dataflow since instructions are token driven. The source operands for Scheduled Dataflow are specified by a single offset (R) value and refer a pair of registers where the data values are stored by the predecessor instructions.

3.3 Pipeline Structure of Schedule Dataflow Figure 3 describes the architecture of the Scheduled Dataflow (the figure does not show all the data paths for the Synchronization Processor). The functionality of the pipeline stages of our Execution Processor is described here. Instruction Fetch: The instruction fetch behaves like traditional fetch, relying on a program counter to fetch

I-Structure Cache

Execute

Register Contexts Synch Processor

Write Back Execution Pipeline

Figure 3. General organization of Scheduled Dataflow architecture

the next instruction. The context information can be viewed as a part of the thread id: , and used for accessing register file specific to the thread. Operand Fetch: The operand fetch retrieves a double word from a register file that contains the two operands for the instruction. Each instruction specifies an offset (R) that refers to a pair of registers (as described above). Thus, a read port should supply a double word. Execute: The execute executes the instruction and sends the results to write-back along with the destination addresses (identifying the registers for the destination instructions). Write-back: The write-back writes up to two values to the register file; the two values may go to two different locations in the register file. This necessitates 2 write ports to the register file. As can be seen, this execution pipeline described above, behaves very much like conventional RISC pipelines while retaining the primary dataflow properties; functional nature, side-effect freedom, and non-blocking threads. The functional and side-effect free nature of dataflow eliminates the need for complex hardware (e.g., scoreboard) for detecting write-after-read (WAR) and write-after-write (WAW) dependencies and register renaming. In fact the double word operand memory (or registers) can be viewed as implicit register renaming and a variation of reservation stations utilized in Tomasulo’s algorithm. In our architecture, scheduling and register renaming are implied by the programming model and hence defined statically. The non-blocking nature of our thread model and the use of a separate processor (Synchronization Processor, SP) for thread synchronization and memory accesses (See section 3.4. for more details) eliminates unnecessary thread context switches on long latency operations or cache misses. Our architecture does not prevent superscalar or multiple instruction issue implementations for the Execution Pipeline (EP).

3.4 Separate Synchronization Processor

4 Analytical Model Evaluating the New Architecture 4.1 Overview of the experiment There have been many analytical formulations to predict the performance of multithreaded programs on conventional architectures (see for examples in [2], [4]). In this section, we will show the preliminary performance analysis on our Scheduled Dataflow architecture using Monte Carlo simulations. In order to analyze the architecture in a more realistic light, we generated synthetic workloads and applied these workloads to the simulations representing the different architecture. The workload generation is based on parameters that have been chosen from either published data (e.g., [8]), our observations based on specific architectural characteristics, and observations based on hand-coded programs. Our intention is to emphasize the fundamental differences in the programming and execution paradigms of the architectures (viz, data-driven threads, non-blocking vs blocking threads, no-stalls on memory access vs stalls due to cache misses, no branch stalls vs stalls on misprediction, token driven vs scheduling instructions, etc). Thus, it is not possible to use the same set of parameters for all architectures or come up with a common metric. We used the same “normalized” workload for all architectures. That is, all architectures execute the same amount of useful work, but different architectures have different amounts of overhead instructions, stalls, and context-switches.

Conv Rhamma, L=1R Rhamma, L=3R Rhamma, L=5R SDF, L=1R SDF, L=3R SDF, L=5R

450 400 Exec. time (K unit time)

Using multiple hardware units for the coordination and execution of instructions is not new. We have described three examples of decoupled architectures in section 2. There are other systems where separate hardware units have been proposed to handle the synchronization among threads in multithreaded architectures (e.g., Alewife[1], StartTNG[3], EARTH[7]). We follow this tradition and propose two hardware units for the Scheduled Dataflow. One of the hardware units (EP) will be similar to conventional RISC Pipelines as described previously. The other hardware unit (SP) is responsible for accessing memory to load the initial operands of enabled threads into registers (pre-load) and to store the results produced by threads from registers (poststore); for maintaining synchronization counts for threads and scheduling enabled threads (including allocation of register contexts and placing the enabled thread on the ready queue of the execution unit). We have developed a complete instruction set[11] for the Scheduled dataflow architecture and hand-coded several example programs including some Livermore Loops, matrix multiplication, Fibonacci, factorial functions in our instruction set. We rely on this experience in generating appropriate parameters for the Monte Carlo simulations described in the next section.

500

350 300 250 200 150 100 50 0 1

2

3

4

5

6

7

8

# of concurrent threads

Figure 4. Effect of thread parallelism

Detailed explanation of our models for our simulations representing the three architectures – conventional processor, Scheduled Dataflow processor, and Rhamma processor – can be found in [12].

4.2 Thread parallelism In order to measure the effect of thread level parallelism on the performance of the different architectures, we generated a sequence of threads for each architecture. We took the simple performance model for multithreaded processors suggested by Agarwal[2] to introduce the latency between a pair of threads (the time difference between the termination of a thread and the initiation of a successive thread). We considered three values for latencies, 1, 3, 5 times the length of a thread (L = 1R, L = 3R, L = 5R in figure 4 above). Note that the figure 4 shows the execution times for the same total workload but for varying number of threads comprising the workload. The execution time of conventional processor is not affected by the degree of thread parallelism since this processor executes single threaded programs only. However, as a degree of thread parallelism increases, both Scheduled Dataflow and Rhamma show performance gains. As expected, with only one thread at a time (degree of thread parallelism = 1), multithreaded architectures perform poorly when compared to single-threaded architectures. The figure also shows that Scheduled Dataflow executes the multithreaded workload faster than Rhamma for all values of thread parallelism. In the above experiment we used the same cache miss rates (5%) and the same cache miss penalty (50 cycles) for all architectures. We feel, however, that Scheduled Dataflow would have smaller cache miss rates since the pre-loads and post-stores of thread data facilitates for better cache prefetching than other architectures, as well as better data grouping and placement that can be achieved by compilers. It should also be mentioned that Scheduled Dataflow will provide higher degree of thread parallelism than Rhamma, since non-blocking nature of Sched-

350

350 Conv Rhamma SDF

300 Exec. time (K unit time)

Exec. time (K unit time)

300 250 200 150 100 50

Conv Rhamma SDF

250 200 150 100 50

0 10

20 30 40 50 60 Normalized Thread Length

70

Figure 5. Effect of thread length

uled Dataflow leads to finer-grained threads. These two observations indicate that we can expect even better performance for Scheduled Dataflow than shown in figure 4. In the remaining experiments we will use L = 3R for both Scheduled Dataflow and Rhamma architectures.

4.3 Thread granularity In the previous experiments we set the average thread length to 30, 20, and 50 functional instructions for conventional architecture, Scheduled Dataflow, and Rhamma respectively. It should be noted that the average thread lengths are based upon our observations from analyzing some actual programs written using Scheduled Dataflow instructions. In this section, we varied the average thread lengths and the results are shown in figure 5. Note that normalized thread length includes only functional instructions and does not include architecture-specific overhead instructions. For conventional and Scheduled Dataflow architectures, increasing thread run-lengths shows performance gains to a certain degree, since longer threads imply fewer context switches. With Rhamma, however, longer thread does not guarantee shorter execution times. The blocking nature of Rhamma threads proportionally causes more thread-blockings (or context switches) per thread as run length increases. Thus, increasing thread granularity without considering other optimizations for blocking multithreaded systems with decoupled access/execute processors may adversely impact the performance.

4.4 Fraction of memory access instructions Since both Scheduled Dataflow and Rhamma decouple memory accesses from pipeline execution, we explored the impact of the number of memory access instructions per thread. Figure 6 shows the results, where the x-axis indicates the fraction of load/store instructions.

0 0.25 0.3 0.35 0.4 0.45 0.5 Fraction of Memory Access Instrunctions

Figure 6. Effect of fraction of memory access instructions

As for conventional architecture, increasing memory access instructions leads to increased cache misses, thus increasing the execution time. However, the decoupling permits the two multithreaded processors to tolerate the cache miss penalties. Note that Scheduled Dataflow outperforms Rhamma for all values of memory access instructions. This is primarily because of the “pre-loading” and “post-storing” performed by Scheduled Dataflow. We feel that the decoupling of memory accesses from execution is more useful if the memory accesses can be grouped together (as done in Scheduled Dataflow).

4.5 Effect of Cache memories Figure 7 shows the effect of cache memories on the performance of three architectures. We assumed 50 cycle cache miss penalty for figure 7(a) and 5% cache miss rate for figure 7(b). As observed in the previous section, both multithreaded processors are less sensitive to memory access delays than the conventional processor. When a cache miss occurs in Rhamma, a context switch (“switch on use”) of the faulting thread occurs. In Scheduled Dataflow, (note that only pre-load and poststore threads access the memory), assuming non-blocking caches, a cache miss does not prevent the memory accesses for other threads. Note that this is not possible in Rhamma since memory accesses are not separated into “pre-loads” and “post-stores.” The delays incurred by preloads and post-stores in Scheduled Dataflow do not lead to additional context switches since threads are enabled for execution only when the pre-loading is complete, and once enabled for execution, they complete without blocking. Once again, we feel that the decoupling memory accesses provide better tolerance of memory latencies when used with non-blocking multithreading models and when memory accesses are grouped for pre-loads and post-stores.

400

400 Conv Rhamma SDF

350

350 300 Exec. time (K unit time)

300 Exec. time (K unit time)

Conv Rhamma SDF

250 200 150

250 200 150

100

100

50

50

0

0 0

2

4

6

8

Cache Miss Rate (%)

(a) Effect of miss rates

10

0

20

40

60

80

100

Cache Miss Penalty

(b) Effect of miss penalties

Figure 7. Effect of cache memories

5 Conclusions In this paper we presented a dataflow architecture that utilizes control-flow like scheduling of instructions and separates memory accesses from instruction execution to tolerate long latency incurred by the memory access. Our primary goal is to show that it is possible to design efficient multithreaded dataflow implementations. While decoupled access/execute implementations are possible with single threaded architectures, multithreading model presents better opportunities for exploiting the decoupling of memory accesses from execution pipeline. We feel that, even among multithreaded alternatives, non-blocking models are more suited for the decoupled execution. Furthermore, grouping memory accesses (e.g., pre-load and post-store) for threads eliminates unnecessary delays (stalls) caused by memory accesses. We strongly favor the use of dataflow instructions to reduce the complexity of the processor by eliminating complex logic needed for resolving data dependencies, branch prediction, register renaming and instruction scheduling on superscalar implementations. Although the results presented here are based on synthetic benchmarks and Monte Carlo simulations, the benchmarks are driven by either published data (e.g., load/store instruction frequencies, branch frequencies, cache miss rates and penalties) or information obtained from analyzing several programs written for the architectures under evaluation. We are currently developing detailed instruction simulations of the proposed Scheduled Dataflow architecture to investigate the performance based on instruction traces.

References [1] A. Agarwal, et. al., “The MIT Alewife machine: Architecture and performance,” Proc. of 22nd Int’l Symp. on Computer Architecture (ISCA-22), 1995, pp. 2–13.

[2] A. Agarwal, “Performance tradeoffs in multithreaded processors,” IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525–539, September 1992. [3] D. Chiou, et.al., “StarT-NG: Delivering seamless parallel computing,” Proc. of the first Int’l EURO-PAR conference, Aug. 1995, pp. 101–116. [4] D.E. Culler, “Multithreading: Fundamental limits, potential gains and alternatives,” Proc. of Supercomputing 91, workshop on Multithreading, 1992. [5] R. Govindarajan, S.S. Namawarkar and P. LeNir, “Design and performance evaluation of a multithreaded architecture,” Proc. of the HPCA-1, Jan. 1995, pp. 298–307. [6] W. Grunewald, T. Ungerer, “A Multithreaded processor design for distributed shared memory system,” Proc. Int’l Conf. on Advances in Parallel and Distributed Computing, 1997. [7] H.H.-J. Hum, et. al., “A design study of the EARTH multiprocessor,” Proc. of the Conference on Parallel Architectures and Compilation Techniques, Limassol, Cyprus, June 1995, pp. 59–68. [8] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publisher, 1996, pp. 105. [9] K.M. Kavi, and A.R. Hurson, “Performance of cache memories in dataflow architectures,” Euromicoro Journal on Systems Architecture, June 1998, pp. 657–674. [10] K.M. Kavi, D. Levine and A.R. Hurson, “PL/PS: A nonblocking multithreaded architecture,” Proc. of the Fifth International Conference on Advanced Computing (ADCOMP ’97), Madras, India, Dec. 1997. [11] H.-S. Kim, “Instruction Set Architecture of Scheduled Dataflow,” Technical Report, Dept. of Electrical and Computer Engineering, University of Alabama in Huntsville, April 1998. [12] H.-S. Kim and K.M. Kavi, “Preliminary Performance Analysis on Decoupled Multithreaded Architectures,” Technical Report, Dept. of Electrical and Computer Engineering, University of Alabama in Huntsville, October 1998. [13] G.M. Papadopoulos, and K.R. Traub, “Multithreading: A Revisionist View of Dataflow Architectures,” Proc. of the 18th International Symposium on Computer Architecture, pp. 342–351. [14] S. Sakai, et. al. “Super-threading: Architectural and software mechanisms for optimizing parallel computations,” Proc. of 1993 Int’l Conference on Supercomputing, July 1993, pp. 251–260. [15] J.E. Smith, “Decoupled Access/Execute Computer Architectures,” Proc of the 9th Annual Symp. on Computer Architecture, May 1982, pp. 112–119. [16] M. Takesue, “A unified resource management and execution control mechanism for Dataflow Machines,” Proc. of 14th Int’l Symp. on Computer Architecture, June 1987, pp. 90–97. [17] S.A. Thoreson and A.N. Long, “A Feasibility study of a Memory Hierarchy in Data Flow Environment,” Proc. of Int’l Conference on Parallel Conference, June 1987, pp. 356–360. [18] M. Tokoro, J.R. Jagannathan and H. Sunahara, “On the working set concept for data-flow machines,” Proc. of 10th Int’l Symp. on Computer Architecture, July 1983, pp. 90–97.