Design Space Exploration of Multiprocessor

0 downloads 0 Views 226KB Size Report
Design Space Exploration of Multiprocessor Systems with. MultiContext ..... requests to the RU address space into control signals for the. RLC. All other memory ...
Design Space Exploration of Multiprocessor Systems with MultiContext Reconfigurable Co-Processors Pranav Vaidya and Jaehwan John Lee ECE Department, Purdue School of Engineering and Technology Indiana University-Purdue University Indianapolis, Indiana, USA

Abstract— Future high performance computing systems may consist of multiple processors and reconfigurable logic coprocessors. As indicated by industry trends, such co-processors will be integrated on existing motherboards without any glue logic. It is likely that such hybrid computing machines will be a breakthrough for various high performance applications. As a result, it has become essential to investigate the system architectures of such machines. This paper describes a fullsystem simulation approach to model and evaluate hybrid computing systems made up of multiple processors and coarsegrained reconfigurable logic co-processors. We develop a fullsystem simulator for such hybrid machines by extending an existing full-system simulator to have device models for multicontext coarse-grained reconfigurable logic co-processors. The proposed full-system simulator is able to execute an unmodified multiprocessor operating system and a multiband filtering application. Using this full-system simulation approach, we have investigated the tradeoffs among various system architectures. Index Terms— Multiprocessor, Reconfigurable Logic, FullSystem Simulation, Multicontext, FPGA, Coarse-grained

I. I NTRODUCTION In recent years, two major trends are evident in the computing industry. Firstly, major microprocessor manufactures are integrating multiple processors/processor cores in a single chip. Secondly, reconfigurable devices are prominently being used for application acceleration. It is likely that these two trends will merge. In the near future, hybrid computing machines made up of several processors and reconfigurable logic co-processors may become commonplace. An indication of this trend is the commodity multiprocessor server platform containing multiple processor cores and reconfigurable co-processor [1]–[3]. These machines offer high performance computation beyond the limitations of Von Neumann machines. It is therefore essential to investigate the system architectures of such hybrid machines and understand any associated issues. One of the most important issues in using reconfigurable co-processors is the cost and the time taken to configure these reconfigurable devices. Commodity reconfigurable devices namely the Field Programmable Gate Arrays (FPGAs) are fine-grained devices. They are made up of a large number of small, simple Configurable Logic Blocks (CLBs). As a result, a large configuration data is required to configure these devices. As configuration time is a proportional function of the size of configuration data, FPGAs have relatively long configuration times

running in the range of several milliseconds. Applications that successfully utilize FPGAs amortize this large cost of configuration by mapping rather long-running functions to the FPGAs. Scientific applications have validated this approach and obtained performance gains of orders of 10100X over software implementations [4]. However, the cost of configuration prohibits fine-grained FPGAs to be used for a large number of applications that contain many medium and short running functions. Coarse-grained FPGAs [5] have shorter configuration times and provide an alternative to fine-grained FPGAs for applications that contain many medium and short running functions. Coarse-grained FPGAs are made up of a small number of ALU-like functional units. The data required to configure a smaller number of functional units is small, and thus, the configuration time for coarse-grained FPGAs is shorter than fine-grained FPGAs. Moreover, the cost of configuration can be hidden from applications by using multicontext coarse-grained FPGAs. Multicontext FPGAs [6] hold a set of device configuration data or contexts in their context memory. A context represents the configuration data required to configure the coarse-grained FPGA to perform one specific hardware function. Coarse-grained FPGAs can switch between these hardware contexts in essentially one clock cycle to perform multi-tasking at the FPGA hardware level. Furthermore, other contexts for the coarse-grained FPGAs can be loaded while the coarse-grained FPGA is performing computation with the active context. Thus, the configuration cost of the multicontext coarse-grained FPGA can be hidden from applications. Multicontext coarsegrained FPGAs are promising co-processors for future high performance computing machines that execute many medium and short running functions. As integration of finegrained FPGAs on existing motherboards has already been demonstrated in the commodity multiprocessor platforms from [1]–[3], it is also possible to follow a similar approach to integrate coarse-grained FPGAs on existing motherboards to create multiprocessor computing systems with several multicontext coarse-grained FPGA co-processors. Applications can be accelerated using these multicontext coarse-grained FPGAs as a customizable data-path engine. Incorporating large data-paths into a single context of FPGA may be constrained due to limited hardware resources. Multicontext coarse-grained FPGAs overcome this limitation by partitioning larger data-paths into multiple contexts. Each context then executes a section of the larger data-path.

Applications can also use the multicontext coarse-grained FPGA as a time-multiplexed shared hardware resource where functions from several applications are loaded into the device, and then multi-tasking at the FPGA hardware level is performed by switching between the contexts. This paper utilizes both these approaches by mapping four large FIR filters onto the multicontext coarse-grained co-processors. To overcome the limitation of hardware resources required by the FIR filter, each FIR filter is further partitioned into two filter stages that are each mapped to a single configuration context. The multicontext coarse-grained FPGA then performs filtering by switching between contexts that represent the filter stages. Our investigation in this paper is based on two important assumptions: 1) Hybrid computing machines consisting of multiple processors and reconfigurable logic co-processors may become common place in the future. 2) Multicontext coarse-grained FPGAs can be used as coprocessors for application acceleration in these hybrid computing machines. Based on these assumptions, we feel that it is necessary to evaluate the architectures of hybrid computing systems that contain multiple processors and multiple multicontext coarse-grained FPGAs. As any investigation into system architectures is an exercise in design space exploration, we evaluate the systems based on the two parameters given below: 1) Multicontext coarse-grained FPGAs versus single context coarse-grained FPGAs 2) Shared coarse-grained co-processors versus dedicated coarse-grained co-processors In this paper, we assume the following limitation. Firstly, we evaluate the performance of statically partitioned applications. We do not consider applications where corresponding software versions of the hardware functions are available and executed in case reconfigurable co-processors are busy. Secondly, we use functional models of traditional processors. In this case, we assume that each instruction is executed in one cycle. In the past, similar investigations have evaluated uniprocessor systems with single reconfigurable logic co-processors. Our approach differs from the previous approaches as we evaluate a multiprocessor, multireconfigurable co-processor architecture. Using the proposed full-system simulator, we have modeled four PowerPC processors and up to four Reconfigurable co-processor Units (RU) each capable of holding one, two or eight configuration contexts in the context memory. Most prior approaches have not considered Operating System (OS) as part of the executing workload. As our simulator is able to run the Atalanta multiprocessor Real-Time Operating System (RTOS) [7] (described later in the paper), we model the workload of the system in terms of both the OS and the application. Moreover, most previous approaches have modified the instruction set of the processors with instructions

to configure and use the co-processors. This might not be possible for real-world commodity processors. As opposed to these approaches, we provide simple functions that act as device drivers for the co-processors. Our approach is similar to that used in hybrid systems currently available in the market [1]–[3]. II. P REVIOUS W ORK In this section, we outline the previous research work in this field.

A. Approaches to Minimize Cost of Configuring an FPGA The improvements suggested by the research community to reduce configuration time may be classified in two categories, (i) to reduce the amount of configuration data and (ii) to make configuration time transparent to the applications. Partial reconfiguration [8], configuration compression [9] and coarse-grained FPGAs [5] can be classified under category (i). Configuration prefetching [10] and multicontext FPGAs [6] can be classified under category (ii). Partial reconfiguration is the configuration of a small portion of the device while other portions of the device remain operational. This results in smaller configuration data targeted towards the smaller portion of the device that is being configured. Configuration compression employs lossless data compression techniques to reduce the amount of configuration data. This compressed configuration data is then decompressed on-the-fly during the actual configuration process. Configuration prefetching utilizes an attached processor to download the next configuration data as early as possible to the FPGA. As a result, the configuration data download is made transparent to a program running on the attached processor. Multicontext FPGAs and coarse-grained FPGAs have been described in Section I. Multicontext coarse-grained FPGAs combine the advantages of the both these approaches and reduce configuration time of the FPGA and make such configuration transparent to the applications.

B. Multicontext Reconfigurable Devices Multicontext FPGAs can be both fine-grained as well as coarse-grained. In the area of fine-grained FPGAs, Trimberger et al. [11] suggested the multicontext extensions to Xilinx XC4000E. DeHon proposed the Dynamically Programmable Gate Array (DPGA) architecture [12] and outlined a way to couple this co-processor unit with a microprocessor. Similarly, PipeRench [13] used a multicontext array of coarse-grained processing elements. The Garp processor [14] was one of the first system architectures to extend a single processor with a reconfigurable co-processor. Current fine-grained FPGAs available from Xilinx [15] contain multiple RISC processors within the FPGA fabric. Devices from Chameleon Systems such as CS2000 couple a coarse-grained FPGA with a traditional processor.

C. Simulation and Co-Simulation Research Work Traditionally, simulation and co-simulation approaches have been employed to model hybrid architectures. OneChip [16] investigated the integration of a reconfigurable co-processor within the pipeline of a RISC processor. Similarly, Platzner [17] et al. investigated hybrid systems containing a single multicontext coarse-grained FPGA coprocessor and a single processor. Both these approaches used SimpleScalar [18] to model the processor. The reconfigurable co-processor was not modeled explicitly in OneChip. Instead, functionality and execution latency of particular configuration was used to model the co-processor. Platzner et al. developed the reconfigurable co-processor model explicitly in VHDL and coupled it with SimpleScalar to provide cycle-accurate simulation. Co-processor coupling approaches such as investigated by Platzner et al. have extended the instruction set of the processor with specific instructions to program and control the reconfigurable logic co-processor. This may not be feasible with existing commercial processors. Some researchers have used full-system simulators such as Simics [19] with co-simulation environments such as ModelSim [20] and Seamless Co-Verification Environment [21] to model uniprocessor systems coupled with a single reconfigurable co-processor [22]. All of the above approaches have evaluated a single processor system with a single reconfigurable co-processor. However, industry trends indicate that it will be possible to integrate multiple reconfigurable co-processors on existing multiprocessor motherboards [1]–[3]. It is quite likely that in the near future, multiprocessor motherboards will contain several processors and reconfigurable logic co-processors. However, there has been no investigation into how multiple reconfigurable co-processors should be used in a multiprocessor system. Furthermore, there is no full-system simulation environment to model such hybrid computing machines made up of multiple processors and reconfigurable logic coprocessors. In this paper, we present a full-system simulator for such hybrid multiprocessor and multicontext coarse-grained reconfigurable co-processors. Then, we investigate whether application performance can be improved by adding hardware in the form of shared multicontext co-processors with more contexts or dedicated multicontext co-processors with fewer contexts. III. OVERVIEW OF THE S YSTEM A RCHITECTURE E XPLORED In this section, we first explain in detail the architecture of the coarse-grained RU model and the multiprocessor model. We also illustrate the coupling of RU co-processors with the multiprocessor model. We then present an overview of the architectures explored. Finally, we explain the Atalanta RTOS [7] that was tested on these architectures and describe the driver functions used by applications to configure and use these co-processors. These driver functions could also be provided part of OS device drivers.

FIFO

FIFO

FIFO

Reconfigurable Logic Controller (RLC)

FIFO

Computation Engine

Cross-Bar Interconnect

Command Register

.... Context Memory

Cycle Register

Reconfigurable Co-Processor Unit

Fig. 1.

Reconfigurable co-processor Unit Architecture

A. Coarse-grained Model

Reconfigurable

Co-Processor

Unit

Fig. 1 shows the overall architecture of the Reconfigurable co-processor Unit (RU). Our RU model is similar to the RU implemented in [17] and [25]. As opposed to the model proposed in [17], our RU model consists of four FIFOs, a Reconfigurable Logic Controller (RLC), a computation engine and multicontext memory. The FIFOs are dual ported for the computation engine side and the RLC side. The FIFOs store the input and output application data values used by the RU computation engine. The size of each FIFO is 512x16 bits. The number of configuration contexts that the RU can hold in the context memory can be varied from one to eight. The computation engine consists of a homogeneous two-dimensional array of ALUs. The size of this two-dimensional array of ALUs is customizable. However, it should be noted that a larger computation engine requires larger configuration data. In our research, we have used identical computation engines in each of the system architectures. The ALUs are interconnected with reconfigurable interconnects to provide custom data paths. CPUs do not directly access the computation engine within the RU. Instead, the RLC controls the computation engine according to the commands issued by the CPUs. Similar to the model proposed in [17], our RU model can be customized by varying the following parameters namely, different datapath sizes (of 8 and 16 bits), the size of the FIFO buffers, the size of the computation engine and the number of configuration contexts stored in the context memory. In our research, we have varied the number of contexts in the RU co-processor model to obtain the RU variants used in system architectures B, C, D, E and F. 1) Overview of the Computation Engine: The RU computation engine contains an array of 4 x 4 cells made up of the processing and reconfigurable interconnection part. This computation engine is similar to the coarse-grained architectures investigated in [17] and [25]. Fig. 2 outlines the processing core of each cell, which constitutes the computation engine shown in Fig. 3. Each cell consists of a floating point Arithmetic and Logic Unit (ALU), datapath multiplexors, input registers and output registers. The control

Fig. 2.

Cell Structure Fig. 3.

signals for the ALU and the multiplexors are part of the configuration data. A special register called const contains a constant operand which can be routed to both the ALU inputs. The value of the const register is also part of the configuration context. Applications can use this register to specify a constant operand for computation. In our research, we have used this register to specify the filter coefficients of our filters. Both the ALU inputs and the ALU output can be registered. Furthermore, the registered output or the unregistered output can be fed back to the ALU inputs. The ALU implements common arithmetic operations such as addition, subtraction, multiplication and bit-wise logic operations such as AND, OR, NOT and NOR. Fig. 3 illustrates the structure of a computation engine and shows the reconfigurable inter-connections among the cells. Each cell is directly connected to the outputs of north, north-east, north-west, west and south neighbors. Similarly, each cell is directly connected to the inputs of its north, east, south, south-west and south-east neighbors. This interconnect scheme is cyclically continued across the array borders. Furthermore, cells on the leftmost column and the top row are directly connected to input ports IP0 and IP1. Similarly, cells on the rightmost column and the bottom row are directly connected to output ports OP0 and OP1. Each cell contains input multiplexors (Mux1 and Mux2 as shown in Fig. 2) to determine how the inputs of the cell are connected to the outputs of its neighbors. The control signals for these input multiplexors are specified as part of configuration data. For a datapath size of 16 bits, the size of configuration data per context for a computation engine is: 33 bits per cell * 16 (i.e., computation engine cell array size of 4x4) + 84 bits for port controllers * 2 (ports, i.e., one for input and one for output) = 87 bytes. 2) Reconfigurable Logic Controller (RLC): The RLC is the main controller for the RU and contains four main

Structure of Computation Engine

registers namely (a) a cycle register, (b) a cpuID register, (c) FIFOowner registers and (d) a command register. The cycle register is two bytes long and is used by the master processor to set the number of cycles for which the RU can run. With each cycle, the RLC decrements the value stored in its cycle register. The cycle register can be considered as a status register for RU computation. A non-zero value in the cycle register implies that the RU is performing computation. The CPU synchronizes with the RU by continuously polling this register. Similar synchronization approaches were used in [17] and the GARP Processor [14]. It should be noted that interrupt handshaking mechanisms between CPU and RU may be advantageous as opposed to polling. However, we do not implement this approach in our system and postpone the discussion of using interrupts to perform asynchronous communication between CPUs and RUs to our future work. The RU is configured using the 32-bit command register in the RLC. The command register is used to specify the commands performed by the RU such as configuration of computation engine contexts and switching of contexts. These commands are described in Table I which shows the command register format. 3) Coupling Between the RUs and the CPUs: The CPUs are interconnected to the RUs using a crossbar switch. The RU devices are memory mapped in the CPU memory space. The crossbar switch translates memory read/write requests to the RU address space into control signals for the RLC. All other memory read/write signals are forwarded to the memory controller. This approach is similar to the coupling approach between processors and reconfigurable co-processors in [2] and [3]. In our architecture, the RU is assigned to a master during initialization in the dedicated mode. As a result, only the master CPU can read/write data values to the RU FIFOs.

TABLE I C OMMAND R EGISTER F ORMAT Command (4 bits)

0x1:

0x2:

0x3: 0x4: 0x5: 0xA: 0xF:

Context Number (4 bits)

Value (3 bytes)

Request the RLC to write all configuration contexts. The RLC will only allow all configuration contexts to be overwritten if the computation engine is not performing any computation. When a command = 0x1 is received, all bytes received in the value field (described below) will be treated as configuration data for all contexts. RLC will load the configurations of each of the contexts serially, until it receives the special 0xA command. Request the RLC to write a specific configuration context. When a command = 0x2 is received, all bytes received by the RU for this context are considered configuration data until the RLC receives the special 0xA command. Switch the active context to a context number specified by the Context Number field (described below). Reset the FIFO head of a given context (specified in Context Number field) to a given value (specified in the value field). Reset the FIFO tail of a given context to a given value. This command indicates the end of configuration data. This command is used to configure the RU device to run a specified number of cycles. The cycles are specified in the lower two bytes of the value field. Commands 0x2, 0x3, 0x4 and 0x5 use this field to specify the context number. For example, command 0x3 is a command to switch the RU from current active context to the context specified in this Context Number field. This field is used to specify the configuration data, value of FIFO head/tails index pointers, value written to the FIFO and the number of cycles, depending on the command field.

Furthermore, only this CPU can load the RU configuration contexts in the dedicated mode. RLC determines this by storing the CPUID of the master processor in a 8-bit cpuID register. In the shared mode where several CPUs share one RU, each FIFO is exclusively allocated to one CPU. Only this CPU can read/write to this FIFO. The CPUID of the CPU to which a given FIFO has been allocated is stored in four 8-bit registers corresponding to the four FIFOs. These registers are called the FIFOowner registers. The computation engine reads/writes to one of the FIFOs as specified within the configuration data. It is possible that two or more configuration contexts can read/write to the same FIFO. This is what we have done in our experiments where cascaded filter stages read/write to the same FIFO. Each configuration context maintains its own set of FIFO index registers/pointers. There are two index pointers per FIFO representing the head and tail of the FIFO. Each input port reads a sample value from the FIFO head in one clock cycle and increments its index pointer to the next FIFO location. Similarly, each output port writes the computed result value to a specified FIFO tail location and increments its index pointer to the next FIFO location. The FIFO index pointers cyclically wrap around the FIFO from the end of the FIFO to the start of the FIFO.

B. Microprocessor Model for MPC750 We utilize the GxEmul instruction set simulator [24] to model our MPC750 multiprocessor system. Since GxEmul is not a cycle accurate simulator, we assume that each instruction is executed by the processor in one clock cycle. This is along the lines of the argument put forward by popular full-system simulators such as Simics [19]. GxEmul can be customized by specifying the number of CPUs in the simulated machine. We utilized the existing GxEmul models for multiple processors, which are modeled functionally using the C language. Furthermore, we ported an existing symmetric multiprocessing RTOS (further described in Section III-D) on the target architectures. We decided to use GxEmul over other popular research simulators such as Simics [19] because of two reasons. Currently, multiprocessor configurations for MPC750 are not available in Simics. Moreover, it was relatively easy to port and test a symmetric multiprocessor operating system such as Atalanta RTOS [7] on GxEmul. C. System Architectures Explored In this section we explore the design space of the system architectures by varying two parameters namely (a) sharing of the RUs and (b) the number of contexts of the RU. As a result we obtain six configurations shown in Table II. With these system architectures, we are able to investigate the trade-offs between employing shared multicontext coprocessors with more contexts versus dedicated multicontext co-processors with a fewer number of contexts. In this paper, we evaluate the performance of a statically partitioned application on the investigated architectures. Furthermore, we compare the performance of the application on these architectures with the performance of the application on a multiprocessor software only system. 1) System Architecture A - 4 Processors, No RU: As shown in Fig. 4, system architecture A consists of four 32-bit MPC750 processors interconnected with a simple crossbar device. This crossbar device is responsible for (a) controlling Inter-Processor Interrupts (IPI), (b) generating control signals for each CPU such as halt and sleep (for simulation purposes), (c) generating control signals for the RUs and (d) forwarding memory read/write requests to RUs and the memory controller. A similar crossbar concept has been applied to the AMD multicore processors [23]. The system has 256MB of main memory, a simple console for TABLE II T YPES OF E XPERIMENTAL S ETUPS E VALUATED

Shared

Multicontext

Single-context

4 CPU 1 RU 8 or 2 Context/RU Architecture B or C

4 CPU 1 RU 1 Context/RU Architecture D

4 CPU 4 RU 2 Context/RU Architecture E

4 CPU 4 RU 1 Context/RU Architecture F

Comparison Base

4 CPU Without RU

Dedicated

Architecture A

user I/O and a Real-Time Clock (RTC). The RTC is used to retrieve the current time and issue periodic interrupts. This architecture is used as a software only base for comparison with system architectures B, C, D, E and F. It should be noted that this architecture does not model any real-world system. However, such architecture adequately represents modern real-world systems that have been extended with the reconfigurable co-processors [2].

1-Context RU

MPC 750

MPC 750

Console

MPC 750

MPC 750

of the device. In a multiprocessor system the number of processor sockets are limited. Consequently, sharing of coprocessors might become necessary. System architectures B, C and D model such systems. Comparing the performance of applications on architectures B, C and D allows us to evaluate any performance gains obtained by adding more configuration contexts per RU device.

Cross Bar

RTC

Console

MPC 750 Cross Bar MPC 750

Memory Controller

Shared Memory

MPC 750

Fig. 4.

Fig. 7.

Architecture A.

2) System Architecture B - 4 Processors, 1 Shared RU with 8 Contexts: As shown in Fig. 5, system architecture B consists of a single RU interconnected with four MPC750 processors using a crossbar. The crossbar device allows one CPU to access the memory while another is accessing the RU. The RU is capable of holding eight configuration contexts. This architecture represents a shared multicontext co-processor with a large number of contexts.

Architecture D.

5) System Architecture E - 4 Processors, 4 Dedicated RUs with 2 Contexts: As shown in Fig. 8, system architecture E consists of a dedicated RU co-processor to each MPC750 processor. The RUs are coupled to the processors using the same crossbar interconnect. The RU architecture in this system is similar to that of the RU architecture in system architecture C. However, the system architectures C and E are different as they use shared and dedicated RU coprocessors, respectively.

8-Context RU

MPC 750

2-Context RU

MPC 750 Console

MPC 750 Cross Bar

RTC 2-Context RU

MPC 750

Memory Controller

Cross Bar

Shared Memory

MPC 750

Fig. 5.

3) System Architecture C - 4 Processors, 1 Shared RU with 2 Contexts: As shown in Fig. 6, system architecture C consists of a single RU interconnected with four MPC750 processors using a crossbar. This architecture is similar to system architecture B except that the RU can hold two configuration contexts instead of eight configuration contexts. This system architecture models an RU with limited hardware resources for holding multiple contexts. 2-Context RU

MPC 750

Console

MPC 750 Cross Bar

RTC

RTC

2-Context RU

MPC 750

Architecture B.

Console

Memory Controller Shared Memory

MPC 750

MPC 750

Memory Controller

Shared Memory

MPC 750

MPC 750

RTC

2-Context RU

Fig. 8.

Architecture E.

6) System Architecture F: 4 Processors, 4 Dedicated RUs with 1 Context: As shown in Fig. 9, system architecture F consists of a dedicated RU co-processor to each MPC750 processor. The RUs are coupled to the processors using the same crossbar interconnect. This configuration is similar to system architecture E except that each RU can hold only one configuration context, i.e., the active configuration context. Note that in System Architectures E and F, all CPUs can access their dedicated RUs simultaneously due to the crossbar.

Memory Controller Shared Memory

MPC 750

Fig. 6.

Architecture C.

4) System Architecture D - 4 Processors, 1 Shared RU with 1 Context: As shown in Fig. 7, system architecture D consists of a single RU interconnected with four MPC750 processors using a crossbar. This architecture is similar to system architectures B and C except that the RU can hold a single configuration context, i.e., the active context

1-Context RU

MPC 750

1-Context RU

MPC 750 Cross Bar MPC 750

MPC 750

Console

Memory Controller Shared Memory 1-Context RU

1-Context RU

Fig. 9.

Architecture F.

RTC

D. Atalanta RTOS In our experimentation, we have ported and tested the Atalanta RTOS [7] on our full-system simulator. Atalanta RTOS is a multi-tasking, event-driven, priority based RTOS, which is small, compact and deterministic. Atalanta RTOS can currently be run on ARM and PowerPC shared memory multiprocessor systems. E. Full-System Simulator Environment GxEmul is a full-system simulator. It provides instructionlevel simulation models for MPC750 microprocessors. Additionally, it provides multiple device models for simulating the entire host platform. All the device models are written in C and are available from [24]. In our research, we utilized the existing GxEmul models for a multiprocessor MPC750 system, a console device, a crossbar switch and an RTC. Furthermore, GxEmul allows integration of custom device models into the full-system simulator. The RU device models are integrated as devices into GxEmul. We have written the RU device models entirely in C. The crossbar device interconnects the CPU, RUs and the memory controller. This crossbar is responsible for allowing concurrent access to memory by one CPU while another CPU is polling the RU. As opposed to traditional approaches of extending the instruction set of processors, our approach treats the RU coprocessors as custom devices configured and controlled via simple driver routines. These driver routines are described below: 1) ReqRU(SYS_CPU cpu_id): A process running on a CPU can request the control of the computation engine using this command. This function uses a spin-lock to gain access to the cycle register. In the shared mode of RU, the CPUs use this function to gain exclusive access to the cycle register. That is, this function returns immediately if the CPU is granted access to the RU. Otherwise, it spins until the RU is allocated to the CPU. When the RU is granted to a specific CPU, while the computation is being performed by the computation engine, other CPUs can still download their contexts to the RU. In the dedicated mode, the CPU uses ReqRU function to wait for the computation engine to finish its computation so that it can write to the cycle register. 2) FreeRU(SYS_CPU cpu_id): This function is used by the CPU that has been granted the permission to use the computation engine to release the control of the RU. 3) SetCycleRegister(unsigned int nCycles): When the CPU has been granted permission to write to the RU cycle register using the ReqRU function, this function is used to set the number of cycles in the cycle register of the RLC. Cycle register can only be set if the computation engine is not performing any computation. In case the computation engine is performing computation, the ReqRU function spins until the computation is finished. 4) WriteToRLCCommandRegr(struct RLCCMDReg cmd): This function allows the programmer to write to the command register of the RLC. This function is used by

the applications to write configuration data to the RU and to perform context switching. IV. E XPERIMENTATION W ITH FIR F ILTERING In this section, we have investigated the trade-offs between employing shared multicontext co-processors with more contexts versus employing dedicated multicontext co-processors with a fewer number of contexts. Each CPU runs at 400MHz while the RU runs at 20MHz. We generated and evaluated instruction traces from the execution of an application on each of these architectures. Finally, we compared the performance gains obtained by the application on each of the architectures with the performance of the application on a software only multiprocessor system. For our experiment, we have considered FIR filtering in a multiband processing system used commonly in HiFi systems. This system uses four filters to process a 4-tone input signal into four distinct frequency signals. In a software only approach, each CPU runs a different filter to process the given input signal. The input signal consists of 512 samples processed by all four filters. In the co-processor enabled configurations, these four FIR filters are each partitioned into two filter stages that are each mapped to a single configuration context. This is because each of the FIR filters is too large to fit in a single configuration context. The multicontext coarse-grained FPGA then performs filtering by switching to the context that represents the first filter stage, filtering the sampled input that has been loaded into the FIFO and writing the processed samples back to the FIFO. The device then switches to the other context that contains the second stage of the FIR filter, filters the samples processed by the first stage and writes the final output back to the FIFO. To tackle the reordering problem of samples, we set the FIFO head pointer of the second context to sample number eight. In case of single context FPGA, each stage has to be individually downloaded. After the FPGA is configured with the first filter stage, the FIR filter is run and the results are stored back in the FIFO. The second stage is then downloaded on the FPGA and this stage processes the samples processed by the first filter stage and writes back the final processed values to the FIFO. A. FIR Filter Partitioning Across Contexts. The response Y(z) of an FIR filter is given by the transfer function H(z) operating on input signal X(z). This output is computed by the equation Y(z) = H(z) * X(z). The transfer N −1 function H(z) can be given as H(z) = Σi=0 hi z −i where hi is the filter coefficient. H(z) is a polynomial in z −i and can be factorized into smaller polynomials of lower orders. These factorized polynomials represent filter stages in a cascaded multi-stage filter. In our experiment, we decided to implement each of the four 15-order FIR filters as a cascade of two 8-tap filter stages (each of order 7). Each of the two filter stages is mapped to a single configuration context. The mapping of a single stage to the reconfigurable device is shown in Fig. 10. The coefficients of the FIR filter are part of the configuration data. Additionally, we

verified the outputs produced by the RU with filters created in MATLAB. This was performed to verify functioning of the filters across the context changes in the reconfigurable co-processor experimented.

Fig. 10.

FIR Filter Configuration on RU Device

B. Pseudo-code executed by an Application a. Pseudo-code executed on Architecture B by each CPU: In the system architecture B, each CPU executes the following code after the input samples are stored on the FIFO by an external I/O device which we did not model. Note that in this case the RU can hold entire configuration data for this application: 1) Download the first configuration context corresponding to the first FIR filter stage to the RU context memory using the function WriteToRLCCommandRegr(). 2) Wait till CPU acquires RU using ReqRU() command. 3) Switch to the context which contains first configuration context of the FIR filter using WriteToRLCCommandRegr(). 4) Set the cycle register using the SetCycleRegister() command and start computation. 5) Download the second configuration context corresponding to the second FIR filter stage to the RU context memory concurrently while the RU is executing computation in first context. 6) Poll the cycle register till the cycle register = 0. 7) Switch to the context which contains second configuration context of the FIR. 8) Set the cycle register and start computation. 9) Poll the cycle register till the cycle register = 0.

10) Free the RU, so the other CPUs can acquire the RU cycle register. b. Pseudo-code executed on Architecture C and E by each CPU: In the system architecture C and E, each CPU executes the following code after the input samples are stored on the FIFO: 1) Wait till CPU acquires RU using ReqRU() command. 2) Download the first configuration context corresponding to the first FIR filter stage to the RU context memory using the function WriteToRLCCommandRegr(). 3) Switch to the context which contains first configuration context of the FIR filter. 4) Set the cycle register using the SetCycleRegister() command and start computation. 5) Download the second configuration context corresponding to the second FIR filter stage to the RU context memory concurrently while the RU is executing computation in first context. 6) Poll the cycle register till the cycle register = 0. 7) Switch to the context which contains second configuration context of the FIR filter. 8) Set the cycle register and start computation. 9) Poll the cycle register till the cycle register = 0. 10) Free the RU. c. Pseudo-code executed on Architecture D and F by each CPU: In the system architecture D and F, each CPU executes the following code after the input samples are stored on the FIFO: 1) For architecture D, wait till CPU acquires RU using ReqRU() command. 2) Download the first configuration context corresponding to the first stage of the FIR filter to the RU context memory. 3) Set the cycle register and start computation. 4) Poll the cycle register till the cycle register = 0. 5) For architecture D, free the RU so that the other CPUs can acquire (in a arbitrary fashion) the RU and wait till the CPU re-acquires the RU. 6) Download the second configuration context corresponding to the second stage of FIR filter to the RU context memory. 7) Set the cycle register and start computation. 8) Poll the cycle register till the cycle register = 0. 9) For architecture D, free the RU so the other CPUs can acquire the RU cycle register. C. Experimental Results and Performance Analysis In our experiment, each architecture runs four filters each processing the same 512 data samples of the input signal. In each of these architectures, Atalanta RTOS is booted onto the processors. We include the number of cycles required to boot the operating system as well as the cycles spent in the OS kernel in our analysis of the workload. This represents the true nature of the workload in a real multiprocessor system. Table III shows the simulation results for the six investigated architectures. Our simulator reports execution

TABLE III E XECUTION C YCLE R ESULTS ( THE LARGER , THE SLOWER ) Architecture Without RU (Architecture A) Using Architecture B (1 RU x 8 Contexts) Using Architecture C (1 RU x 2 Contexts) Using Architecture D (1 RU x 1 Context) Using Architecture E (4 RU x 2 Contexts) Using Architecture F (4 RU x 1 Context)

Total

CPU1 Kernel

Appl.

174,123

27,100

12,288

Number of Cycles Executed CPU3 Total Kernel Appl.

Total

CPU2 Kernel

Appl.

147,023

174,535

32,409

142,126

174,173

28,647

6,685

5,603

47,581

10,441

37,140

19,038

12,186

6,685

5,501

50,101

10,441

39,660

12,832

6,685

6,147

51,738

9,928

41,810

34,959

9,697

25,262

31,137

11,414

34,985

9,697

25,288

31,159

11,414

P

Exec. Cycles

Total

CPU4 Kernel

Appl.

Finish Time (Cycles)

145,526

174,173

29,307

144,957

174,535

697,095

8,774

10,264

28,861

8,621

20,240

47,581

107,768

19,794

8,774

11,020

30,498

8,621

21,877

50,101

112,579

21,020

8,774

12,246

32,147

9,135

23,012

51,738

117,737

19,723

35,476

9,900

25,576

37,198

10,241

26,957

37,198

138,770

19,745

35,502

9,900

25,602

37,226

10,241

26,985

37,226

138,872

TABLE IV P ERFORMANCE S UMMARY ( THE LARGER , THE FASTER ; S PEEDUPS CALCULATED FROM FORMULA GIVEN IN [26]) Architecture Using Architecture B (1 RU x 8 Contexts) Using Architecture C (1 RU x 2 Contexts) Using Architecture D (1 RU x 1 Context) Using Architecture E (4 RU x 2 Contexts) Using Architecture F (4 RU x 1 Context)

CPU1 - Speedups 174,123−12,288 12,288 174,123−12,186 12,186 174,123−12,832 12,832 174,123−34,959 34,959 174,123−34,985 34,985

CPU2 - Speedups

= 13.17x

174,535−47,581 47,581

= 13.28x

174,535−50,101 50,101

= 12.56x

174,535−51,738 51,738

= 3.98x

174,535−31,137 31,137

= 3.98x

174,535−31,159 31,159

CPU3 - Speedups

= 2.67x

174,173−19,038 19,038

= 2.48x

174,173−19,794 19,794

= 2.37x

174,173−21,020 21,020

= 4.61x

174,173−35,476 35,476

= 4.60x

174,173−35,502 35,502

time in terms of the number of cycles executed. Table IV summarizes all speedup results that were calculated against the software only approach of system architecture A, i.e., filter implementations running on individual CPUs without co-processors. From our experimentation, we observed that download time of configuration data representing a single stage of FIR filter was approximately 860 cycles. In architectures B and C, the download of configuration data was performed concurrently with the computation. The only difference is that in architecture C each CPU waits till the RU was exclusively granted to it. Architecture D is similar to architectures B and C except that there was no overlap whatsoever between download and execution in RU. In architecture E, the download of configuration data was performed concurrently with computation. However, there were bus contentions as all CPUs were simultaneously performing fetching and download of configuration data. Architecture F is similar to architecture E except that for each CPU there was no overlap whatsoever between download and execution in RU. As seen from Table III, in architectures B, C and D, the shared RU seemed to be allocated to CPU1, CPU3, CPU4 and CPU2 in that order. We obtained expected results shown in the second last column of Table III, from fastest to slowest, architectureE > architecture-F > architecture-B > architecture-C > architecture-D > architecture-A, respectively. Although we expected that application performance on architectures E and F would be much better than application performance on architectures B, C and D, architectures E and F were only slightly better due to the memory bottleneck faced by each CPU while attempting to fetch the configuration data from shared memory and download it to RU.

CPU4 - Speedups

= 8.14x

174,264−28,861 28,861

= 7.79x

174,264−30,498 30,498

= 7.28x

174,264−32,147 32,147

= 3.91x

174,264−37,198 37,198

= 3.91x

174,264−37,226 37,226

Speedups in terms of Finish Time

= 5.04x

174,535−47,581 47,581

= 2.69x

= 4.71x

174,535−50,101 50,101

= 2.48x

= 4.42x

174,535−51,738 51,738

= 2.37x

= 3.69x

174,535−37,198 37,198

= 3.69x

= 3.68x

174,535−37,226 37,226

= 3.68x

Although the finish time of the application in architectures E and F was smaller than the finish time of the application in architectures B, C and D, certain CPUs individually executed fewer number of cycles in architectures B, C and D when compared with architectures E and F. This is mainly because in architectures C and D, downloading the configuration data to the RU for each stage was completely serialized as there was only one RU, resulting in no bus contention. This was also the case in architecture B except that concurrent execution of step 1 by all CPUs (as shown in the pseudo code) resulted in bus contention during download of the first configuration context. This bus contention is visible in the form of a larger number of cycles executed by CPU1 in architecture B as compared to architecture C. As opposed to architectures B, C and D, there were bus contentions in architectures E and F due to the presence of a single shared memory even though these architectures had dedicated RUs. Finally, the last column of Table III suggests an interesting tradeoff between aggregated execution time versus resource and power consumption. From the results, we make the following conclusions: 1) Multicontext devices show speedups over single context devices in both shared and dedicated architectures as computation was performed concurrently with loading of configuration contexts. This can be seen from the lower number of cycles executed by each CPU in architectures B, C and E as compared to architectures D and F. 2) Dedicated co-processors perform better than shared multicontext co-processors. This can be seen from the finish time in architectures E and F as compared to architectures B, C and D.

Based on the above observations and conclusions, we make the following suggestions: 1) Dedicated multicontext co-processors perform better than shared multicontext co-processors. This makes a strong case for integration of dedicated co-processors within the processor chip itself. Furthermore, we hypothesize that the dedicated co-processor architectures would have performed much better if dedicated memories or caches were present on processor chips to store configuration contexts. This would have minimized bus contention to fetch configuration contexts from the shared memory. 2) In this experiment, we have assumed statically partitioned applications. We hypothesize that dynamically partitioned applications might perform better with shared RUs. If compilers produce both the hardware and software versions of compute-intensive functions, then a CPU can execute the software versions in case the RU co-processor is busy. To ensure that the decision to run hardware or the software version of a function is made correctly, the operating system should be reconfigurable logic co-processor aware. V. C ONCLUSION In this paper, we present the comparative evaluation of six hybrid multiprocessor systems with multiple reconfigurable co-processors using our extended full-system simulator. We have discussed the various hardware architectures investigated and software APIs provided to control the coprocessor units. To evaluate the system, we have mapped FIR filters to various system configurations investigated. Our results indicate that multicontext devices have a potential for speedups as they hide the latency of downloading the configuration data. Total execution time of a system can be reduced if there are dedicated co-processor units for each CPU. In the future, the number of processor cores in a system will increase exponentially. However, the number of reconfigurable co-processors will still be limited. Sharing of co-processors may degrade the system performance. The high performance of dedicated co-processors justify the need for reconfigurable co-processors to be integrated within the processor chip. Further speedups may also be achieved if advances suggested to the compilers and operating systems are made. Although, we have made some assumptions in this research, our results convey valuable information about using reconfigurable co-processors in a multiprocessor system. Our future work includes: •



To extend our simulator to include dedicated memories/caches for configuration data. This is postponed to our future work as this will require changes to GxEmul. To investigate and evaluate other applications using our simulator for a comprehensive evaluation of the architectures explored in this paper.

VI. ACKNOWLEDGEMENT We would like to thank Anders Gavare for his help in using GxEmul. R EFERENCES [1] Celoxica RCHTX System, http://www.celoxica.com/products/rchtx/default.asp, visited Jan 2006. [2] DRC Computer Systems, http://www.drccomputer.com/pages/modules.html, visited Jan 2006. [3] XtremeData Inc, http://www.xtremedatainc.com/, visited Jan 2006. [4] M. Gokhale and P. Graham, “Reconfigurable computing - accelerating computation with field programmable gate arrays,” Springer, The Netherlands, 2005. [5] E. Mirsky and A. DeHon, “MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources,” IEEE Symposium on FPGAs for Custom Computing Machines, 1996 [6] T. Kitaoka, H. Amano and K. Anjo, “Reducing the configuration loading time of a coarse grain multicontext reconfigurable device,” In Field-Programmable Logic and Applications, pp. 171-180, 2003 [7] D. Sun, D. Blough and V. Mooney, “Atalanta: A new multiprocessor RTOS kernel for system-on-a-chip applications,” Tech. Rep. GIT-CC02-19, College of Computing, Georgia Tech, Atlanta, GA, 2002. [8] J. Hadley, “The performance enhancement of a run-time reconfigurable FPGA system through parital reconfiguration,” Master’s thesis, Brigham Young University, Provo, UT, Nov. 1995 [9] Z. Li and S. Hauck, “Configuration compression for Virtex FPGAs,” IEEE Symposium on Field-Programmable Custom Computing Machines, 2001 [10] S. Hauck, “Configuration prefetch for single context reconfigurable coprocessors,” ACM/SIGDA International Symposium on FPGAs, pp. 65-74, 1998. [11] S. Trimberger, D. Carberry, A. Johnson and J. Wong, “A TimeMultiplexed FPGA,” Proceedings of the IEEE Workshop on FPGA Custom Computing Machines, pp. 22-28, 1997. [12] A. DeHon, “DPGA-coupled microprocessors: Commodity ICs for the early 21st century,” 2nd IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), pp. 31-39, 1994. [13] S. Goldstein, H. Schmit, M.Moe, M. Budiu, S. Cadambi, R. Taylor and RLaufer, “PipeRench: A coprocessor for streaming multimedia acceleration,” In Proceedings of the 26th Annual International Symposium on Computer Architecture, pp. 28-39, May 1999. [14] J. R. Hauser and J. Wawrzynek, “Garp: A MIPS processor with a reconfigurable coprocessor,” 5th IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), pp. 12-21, 1997. [15] Xilinx Corporation, http://www.xilinx.com/, visited Jan 2006 [16] J. E. Carrillo Esparza and P. Chow, “The effect of reconfigurable units in superscalar processors,” 9th ACM Int. Symp. on FieldProgrammable Gate Arrays (FPGA), pp. 141-150, 2001. [17] M. Platzner, “Co-simulation of a hybrid multi-context architecture,” Engineering of Reconfigurable Systems and Algorithms (ERSA), June 2003. [18] T. Austin, E. Larson and D. Ernst, “SimpleScalar: An infrastructure for computer system modeling,” IEEE Computer, vol. 35, no. 2, pp. 59-67, 2002. [19] P. Magnusson, et al., “Simics: A full system simulation platform,” IEEE Computer, vol. 35, no. 2, pp. 50-58, 2002. [20] Mentor Graphics, ModelSim. http://www.mentor.com/modelsim. [21] Mentor Graphics, Hardware/Software Co-Verification: Seamless. http://www.mentor.com/seamless/. [22] W. Fu and K. Compton, “A Simulation Platform for Reconfigurable Computing Research,” IEEE International Conference on Field Programmable Logic and Applications, Aug. 2006. [23] AMD Multicore Website, http://multicore.amd.com/us-en/AMDMulti-Core.aspx [24] GxEmul, http://gavare.se/gxemul/. [25] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin and B. Hutchings, “A reconfigurable arithmetic array for multimedia applications,” ACM/SIGDA Seventh international Symposium on Field Programmable Gate Arrays, 1999. [26] J. Hennessy and D. Patterson, “Computer architecture - a quantitative approach,” Morgan Kaufmann Publisher, Inc., San Francisco, CA, 1996.