single chip dsp array processor: 100 million + transistors with ...

1 downloads 0 Views 56KB Size Report
... can not be fully pipelined on processor cycle level and thus achieve a throughput ... desirable in classic systolic arrays, since it lengthens latency and thus systolic ... instructions. Contemporary RISC processors are able to issue and execute.
SINGLE CHIP DSP ARRAY PROCESSOR: 100 MILLION + TRANSISTORS WITH MULTITHREADING APPROACH Radovan Sernec1 Matej Zajc2 -XULM)7DVLþ2 1 BIA Ltd., Teslova 30, 1000 Ljubljana, Slovenia [email protected] 2 )DNXOWHWD]DHOHNWURWHKQLNR7UåDãND/MXEOMDQD6ORYHQLD [email protected] ABSTRACT We propose an efficient programmable parallel architecture for DSP and matrix algebra applications that can exploit parallelism at algorithm (topology) level via systolic/SIMD array processing and at instruction level via multiple-issue control processor capable of multithreading. Our premise is: »One array – one chip« and integration of systolic/SIMD processing on the same processor array with required data storage. Multithreading on systolic/SIMD arrays is analysed on examples, which show that substantial speedups are possible (100% - 800%) when up to four threads are interleaved on cycle-by-cycle basis. We are targeting processor element (PE) granularities in word range and employ support for floating-point operations. Furthermore the array integrates data memory at two levels of hierarchy: local per PE (SIMD) and global for the whole processing array (systolic). Complexity of such a system is explored in detail and is shown that 32 PE array can be implemented on a 120 million transistor chip. 1. INTRODUCTION Systolic arrays are efficient parallel architectures used in digital signal processing, for solving linear algebra algorithms and other problems. They are exploiting regularity in data flows and processor interconnection network topology, local processor cell communication, synchronous operation and single instruction stream applied to many data elements (instruction systolic arrays are excluded from this definition) [1]. To properly synchronise data flows within systolic array, unit delays driven by global clock cycle are inserted on PE communication data paths. Net throughput of the array can be low compared to processor cell cycles time and is inherently limited by the data availability constraints. Multithreading is a technique usually used to mask long memory or interprocessor communication latency times and also to prevent data hazards within pipelines [2]. These problems became more severe recently, since memory cycle times are not reducing with the same pace as processor cycles do. The idea is straightforward: when any kind of latency or interrupt occurs, switch to an independed thread of execution and do a back switch to previous instruction stream when the relevant data becomes available. In this paper we apply multithreading principles to systolic/SIMD arrays in order to increase their net throughput and explore architectural details of such a processor ensemble. Target applications and algorithms that can be run on our proposed architecture are from two domains: DSP and matrix algebra. The first includes a set of systolised convolution, FIR, IIR, DFT, DCT and similar algorithms in 1D and 2D variants and the second vector and matrix algebraic operations, linear equation solving and a modified Faddeev algorithm (FA) [3], which itself offers a multitude of combinations of matrix operations. Additionally, every algorithm that can be systolised to 1D or 2D processing array can be executed on this PE array. The paper is organised as

follows: firstly a case study of systolic 1D convolution and its transformation to multithreaded case is presented, followed by processor architecture overview and possible implementation combining processor array with data memories on the same chip. 2. MULTITHREADING IN SYSTOLIC/SIMD ARRAYS Let’s analyse why systolic algorithms can not be fully pipelined on processor cycle level and thus achieve a throughput of 1 result/processor clock cycle. By systolic cycle we mean a global clock cycle used to move data elements through the systolic array and all interprocessor element communication registers are synchronised to this clock. The throughput can be limited due to four different reasons: 1. Lower than 100 % efficiency of a given systolic algorithm; when input data are spaced with dummy values in order to satisfy proper data synchronisation within array. If after each input data element a dummy one follows, efficiency drops to 50 %. 1D convolution algorithm studied bellow is such a case. 2. Read-after-write hazards within processing cell. 3. Long latency of operations; notation of systolic algorithms assumes that operations inside each processing cell take zero time. To achieve proper synchronisation unit delays are inserted. Input of data values is bounded by systolic cycle, which is equal to multiple processor clock cycles; 4. Slow memory cycle times compared to processor cycle times; many DSP systolic algorithms are very simple and thus exhibit short operation latencies, but new data value can not be fetched due to memory cycle constraints or is not available from the cell above, since it was not calculated yet. We can see that a common factor limiting throughput is availability of data. The obvious solution is thus the introduction of interleaved independent data sets into systolic array. Let’s take a look at a simple case of systolic 1D convolution of which processing cell is shown in Figure 1.

CH 1

CH 3

r1

×

r3

C

r4 CH 2

r2

Σ

r5

CH 5

Figure 1: Systolic cell for 1D convolution used in examples bellow. Note register names included on data paths.

12

11

9

8

x

7

x

6

4

x

5

3

rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5 rc2f r1,r2 fmul r4,r1,r3

2

1

PE cycle Instruction

10

x x

x

x x y y

y

y

Figure 2: Gantt chart for 1D convolution on linear systolic array; length of systolic cycle is 8 clock cycles, second iteration, i.e. next data input (although dummy) can start on clock cycle 9 (denoted by bold y letters); note the inherent read-after-write hazard between multiplication and addition instructions. Contemporary RISC processors are able to issue and execute several instructions simultaneously. Can multithreaded systolic arrays benefit from multi-instruction issue? In order to study these effects our implementation must allow the issue of more than one

instruction every processor cycle. We decided to set a limit at two instructions issued every cycle.

x y z

y

12

x y

11

x

10

z

9

y z u

8

5

y z

7

4

y

6

3

rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5 rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5 rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5

2

1

PE cycle Instruction

u

u x

u x y

x

u

z

x

11

x

9

x

8

x

7

x

6

4

x

5

3

rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5 rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5 rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5 rc2f r1,r2 fmul r4,r1,r3 fadd r5,r2,r4 wc2f r1,r5

2

PE cycle Instruction

10

Figure 3: Gantt chart of four threaded 1D convolution on linear systolic array; letters x, y, z, u denote data inputs from different threads; systolic cycle is four clock cycles long; note that there are no data dependencies among successive arithmetic operations, since data come from independent threads.

1

Execution of the algorithms is presented with Gantt charts. Cycles are processor cycles within each processing element. Note that this particular algorithm has 50 % efficiency, meaning that a dummy value is following each data element. Figure 2 shows the assembly-coded algorithm running in each processing cell. All instructions are in the form: Mnemonic DST, SRC1 and SRC2. Instructions rc2f (read two floating-point values from channels) and wc2f (write read two floating-point values to channels) take care of input and output respectively. Convolution constant is stored in r3. We see that systolic cycle can not be shorter than 8 processor cycles, assuming the latencies presented. Dummy data can thus be input on cycle 9 (50 % efficiency) and the next data value on cycle 17. Assume for the moment that our processor works at 100 MHz, so the throughput of the whole systolic array is only 6.25 million data values/s. We next apply interleaving of independent data values - multithreading to this problem. Each thread can have its own register file or alternatively can use a larger common one with appended tag bits differentiating different threads within one register file. We are assuming that there are four independent data threads available, which can be freely interleaved to achieve minimal initiation interval. Observe Figure 3 where instructions are issued every cycle, unlike the single threaded case where instruction issuance was dictated by RAW hazards. Note that our processor has still the same architecture and can issue only one instruction per cycle. Instruction groups of four, which belong to the same systolic cycle, take data as operands belonging to different independent threads. Fadd operation can thus start immediately after fmul, since their operands are from independent data sets. Throughput is greatly enhanced. Four data values are processed within 16 processor cycles. The same number of cycles is needed in classic case for only one data value. Lower inherent algorithm efficiency was here eliminated completely, since dummy values are treated as belonging to different data set. For the same implementation assumptions we get the net throughput of 16 cycles/4 data values at 100 MHz, which is 25 million data values/s. This is 300% higher than in previous case. There is another point worth mentioning: pipelining of functional units. This feature is not desirable in classic systolic arrays, since it lengthens latency and thus systolic cycle (Figure 2). Multithreading favours pipelined functional units in order to shorten processor clock cycle and increase the net throughput.

x

x y y

y

y y

y

y y

z z

z

z z

z

z z

u u

u

u u

u

u u

Figure 4: Gantt chart for an alternative four threaded 1D convolution on linear systolic array; letters x, y, z, u denote data inputs from different threads; average systolic cycle is 1.8 clock cycles (initiations 1,1,1,1,5); note also multi-issue from cycle two onwards (bold). This is the case of simultaneous multithreading [4]. Number of functional units stays the same. Let us take a look at a different arrangement of multithreaded execution of the same algorithm in Figure 4. Instructions bounded for different functional units are issued simultaneously (denoted by bold letters) on cycles 2, 3, 4, 5, 8. We also assumed that each functional unit has its own port to common register file. On cycle 7, for example adder and multiplier finished execution and if only one result bus is available store of either result must be stalled for one cycle. Alternatively, each thread can be provided with its own register file as already mentioned. This kind of multithreading,

BANK 1

Thread #1

BANK 2

PE11 - PE44

16 × 32

16 × 32 32

MAF 32

COMMON RESULT BUS Figure 6: Internal PE structure; Note the replicated register file, one for each thread, which can be run simultaneously on the PE array and a combined I & FP multiply/add fused unit (MAF). 1 MB Controller LM

. . .

32 KB I $

. . .

PE4,1

32 KB D $

FOUR-ISSUE VLIW/RISC Controller

BANK 1

. . .

PE1,1

. . .

#4

BANK 3

#3

FP RF

I RF

. . .

Control of

#2

32 32 32 32

32 32 32 32

BANK 4

Inst. #1

#2

INTER PE CONNECT

BANK 1

Thread #1

128 32 32 32 32

PE ↔ GM ROUTING

3. PROCESSOR ARRAY ARCHITECTURE: VeMUS2DAP The VeMUS2DAP array processor consists of two parts: multiissue controller and mesh connected PE array. Multi-issue controller steps through the program, executes all scalar operations, branches and issues decoded array instructions (scalar or vector) to PEs. Employing multi-issue controller solves two problems. First, most matrix algebra systolic algorithms require the execution of two different programs on the same PE array. Multi-issue controller is capable of issuing instructions from different instruction streams and direct control to two different parts of PE array as required by systolic algorithm. The second is the enhancement of throughput of PE array is possible by multithreading several independent data streams of the same algorithm on one array. It is worth mentioning that the same multiissue controller is also used to run different algorithms (threads) simultaneously and they compete for the resources within scalar and PE portions of the system. Figure 5 shows instruction slots for the two cases.

file through a 128-bit wide bus via the same four r/w ports. Each of the register files has thus eight ports and 16 32-bit registers. Figure 6 shows the PE. More information on the internals of the VeMUS2DAP can be found in [5]. BANK 4

coupled with multiple instruction issue from different threads is also called simultaneous multithreading. On each cycle, two instructions from different threads are issued simultaneously, although there are as much as four instructions simultaneously in different execution stages (cycles 4, 5, 6, 7, 8). What is the net performance gain? Four data values are processed every 11 cycles, which gives a net throughput of 36.36 million data values/s or 481 % rise compared to classic and 45 % to single-issue multithreaded systems, respectively. Note that multi-instruction issue does not speedup anything in classic systolic case due to inherent RAW hazards among instructions (functional units). Factors not considered here are proper instruction scheduling and data arrangement. In systolic arrays data flow must be precisely synchronised as required by the algorithm. Instruction issuing mechanism must therefore never schedule an instruction for which no data values are currently available. It must be deterministic in a sense that it guaranties, that instructions from different threads will always be scheduled and issued in the same order. This strictly applies to multi-issue multithreading cases. Data flows on the other hand must be 'prepared' in advance in the same manner, i.e. data values from different threads are positioned contiguously in accordance with instruction issuing.

PE15 - PE48 #2

#3

#4

#2

#3

#4

PE1,8

. . .

PE4,8

BANK 8

PE ↔ GM ROUTING

Inst. #1 Control of

whole PE array

Figure 5: Multi instruction issue slot issuing two instructions from two different threads for different portions of PE array and interleaving four threads within issue slot to the whole PE array. Inter PE communication is done via four bi-directional 32-bit buses which are directly tied to four neighbour PEs and additional four r/w register file ports. Local memory is connected to register

Figure 7: Layout of the 32 PE array with two global memory arrays, each having 28 Mb and 56 Mb capacity, respectively. Each bank has 7Mb with 128 bits interface. Controller is augmented with 32 KB split caches and a large local memory. 4. SINGLE-CHIP IMPLEMENTATION Recent studies on semiconductor integration trends have shown that by year 2003 more than 100 million transistors will be available for single chip implementations, and by 2010 even around billion [6]. This initiated different design projects on how to most effectively employ such huge resources. There seems to exist a consensus among different processor designers that a major

part of the future Si dies will be devoted to different memories, but processor granularity levels vary widely. Single processor implementations are not interesting to us, but there are similar multiprocessor projects on the way, as IRAM, chip multiprocessors, computational RAM, PPRAM, etc [7]. There are technology trade-offs that have to be faced with these approaches, namely the cohesion of rather slow, but highly dense DRAM process with very fast, but with 3-5 times lower integration logic process. It can be expected that such processor designs will run at lower speeds, but on chip parallelism will provide performance gains well beyond single processor implementations. Our project is proposing a single chip integration of the whole systolic/SIMD array processor with large amounts of memory, which either acts as a global data pool (systolic mode) or is locally available to each PE (SIMD mode). The main obstacle to 2D integration of an expandable systolic or SIMD arrays onto single chip, which can be expandable, is their high pin count demand. Consider a 2D 8-by-8 PE array, with 32-bit inter PE paths and required expandability by interconnection of multiple such chips to form larger PE arrays. Pin count in this case reaches 4 sides × 8 PE/side × 32 pin/port = 1024 pins. Note that we disregarded additional connection requirements to a large memory pool, which is necessary for systolic arrays. At least 200 additional pins must be added for controller connection to program memory and various handshake signals for its communication with host. Power connections to such chip would add additional 30-40% to existing number for a grand total of approximately 1700 pins. This is at

84*106 1 84*106

I/O interfaces

8.5*106 2.5*105 6*105 1 32 32 8.5*106 8*106 19.2*106 124 million transistors

GM + mux logic

PE LM + mux logic

1.8*106 1 1.8*106

PE MAF unit with 4 register files

1.8*106 1 1.8*106

Controller local memory I&D (1 MB)

Controller D cache

5*105 1 5*105

Controller I cache

# of transistors # of items Sum Grand total

Controller core

ITEM & COMPLEXITY

present and in the near future unachievable or would present a very costly solution. Our goal is to design a single-chip array processor system complete with controller, 2D PE array and enough data memory. Figure 7 outlines its possible layout. PE array is organised as an 8by-4 mesh. The reason for non-rectangular array is the requirement of several matrix algorithms like FA, QR nonrectangular topology to run efficiently. FA requires two seamlessly connected arrays: one rectangular n by n and one triangular n by n. Observe two global memory (GM) arrays (although they reside in the same address space), which is again beneficial for problems larger than the PE array size and facilitates efficient problem partitioning and buffering of intermediate results. Dynamic RAM is used for LM to reduce size and its longer cycle times are largely compensated by 128-bit wide access paths and interleaving. Performance penalty is not likely to be high due to vector nature of data sets. GM has 84 Mb capacity and is also dynamic, partitioned into two large arrays. They are subdivided into four and eight 128 bits wide banks, respectively. GM capacity is large enough to accept 64-bit data in form of four input 5122 matrices and one 5122 output matrix as required by FA. For fast refill of data to GM from larger off chip memories there is a 16-bit wide RAMBUS interface cycling at 500 MHz. The whole GM can theoretically be refilled in approximately 0.01 s with throughput of 8*109 b/s vs. 8.2*1011 b/s needed by the PEs, which are running at 200 MHz. Table 1 summarises the transistor requirements for such a system. The chip contains 124 million transistors and of these less than 10 million are used in logic and the rest in memory arrays.

2*105 1 2*105

Table 1: Complexity of the proposed system; Note that 83% of transistors is taken up by memories. 5. CONCLUSION The proposed integration of a large PE array with sufficient amounts of large data memories is a possible answer on how to efficiently employ millions of transistors, which will be available in a couple of years. Furthermore, its design is much simpler compared to contemporary wide-issue superscalar RISC processors due to replicated data paths and simple controller architecture. Nevertheless that multi-issue logic is required within controller, only independent threads of control are issued in one slot, which greatly reduces dependency resolution logic design. DSP algorithms and tasks require large data vectors or matrices, which can be simply manipulated with FA amenable PE array. Expansion of the PE array is possible, if instruction bus of issued (but not yet decoded) instructions is brought to I/O pins. Instruction bus can then be connected to other chips and work in synchronised lock step fashion, although operating frequency would be lower in this case [8].

6. [1] [2] [3]

[4]

[5]

[6] [7] [8]

REFERENCES S.Y. Kung, VLSI Array Processor, Prentice Hall, 1988. Multithreaded Computer Architecture: A summary of the state of the art, ed. R. Iannucci, Kluwer AP, 1994. J. G. Nash, S. Hansen, “Modified Faddeeva Algorithm for Concurrent execution of Linear Algebraic Operations”, IEEE Trans. on Computers, 37 (2), pp. 129-136, 1988. J. L. Lo, “Compilation issues for simultaneous multithreaded processor”, 1st SUIF Compiler Workshop, Jan. 1996, pp. 146-147. R. Sernec, M. Zajc, J. Tasic, “Design trade-offs of a parallel DSP architecture: Combining instruction and data level parallelism”, sent for publication to WDTA'98, Dubrovnik, Croatia. The National Technology Roadmap for Semiconductors, SIA, San Jose, USA, 1997. IEEE Computer, The Future of Microprocessors, Sept. 1997 http://www.bops.com/