AN FPGA-BASED SOFT MULTIPROCESSOR SYSTEM ... - CiteSeerX

1 downloads 0 Views 139KB Size Report
Linear Programming to explore multiprocessor configura- ... A soft multiprocessor system is a network of programmable ..... after automated exploration. 5.
AN FPGA-BASED SOFT MULTIPROCESSOR SYSTEM FOR IPV4 PACKET FORWARDING Kaushik Ravindran

Nadathur Satish

Yujia Jin

Kurt Keutzer

University of California at Berkeley, CA, USA {kaushikr, nrsatish, yujia, keutzer}@eecs.berkeley.edu

ABSTRACT To realize high performance, embedded applications are deployed on multiprocessor platforms tailored for an application domain. However, when a suitable platform is not available, only few application niches can justify the increasing costs of an IC product design. An alternative is to design the multiprocessor on an FPGA. This retains the programmability advantage, while obviating the risks in producing silicon. This also opens FPGAs to the world of software designers. In this paper, we demonstrate the feasibility of FPGA-based multiprocessors for high performance applications. We deploy IPv4 packet forwarding on a multiprocessor on the Xilinx Virtex-II Pro FPGA. The design achieves a 1.8 Gbps throughput and loses only 2.6X in performance (normalized to area) compared to an implementation on the Intel IXP-2800 network processor. We also develop a design space exploration framework using Integer Linear Programming to explore multiprocessor configurations for an application. Using this framework, we achieve a more efficient multiprocessor design surpassing the performance of our hand-tuned solution for packet forwarding. 1. INTRODUCTION A soft multiprocessor system is a network of programmable processors crafted out of processing elements, logic blocks and memories on an FPGA. They allow the user to customize the number of programmable processors, interconnect schemes, memory layout and peripheral support to meet application needs. Deploying an application on the FPGA is tantamount to writing software for this multiprocessor system. Xilinx provides tools and libraries for soft multiprocessor development on the Virtex family of FPGAs. This environment integrates the IBM PowerPC 405 cores on chip, soft MicroBlaze cores, and customizable peripherals [1]. In the embedded domain, the continuous increase in performance requirements have fueled the need for high performance design platforms. However, the need to adapt products to rapid market changes, and the introduction of new protocols has made software programmability an important criteria for the success of these devices. Hence, the general

trend has been toward multiprocessor platforms specialized for an application domain to address the combined needs of programmability and performance. Application specific software programmable platforms are dominant in a variety of markets including digital signal processing, gaming, graphics and networking. The soft multiprocessor solution proposes to implement these multiprocessors on an FPGA instead of casting the design into silicon. But why would we even consider FPGAs as a medium for these multiprocessor systems? Soft multiprocessors will surely lose a performance factor that attends hardware implementation in FPGA logic versus custom multiprocessor designs. However, we must consider performance gains relative to product design and manufacture costs to understand the benefits of soft multiprocessors. Technology scaling towards smaller process geometries is driving the IC design cost into the $20 million ranges. In turn, product revenues need to reach $200 million to repay the investment [2]. If an ASIC or application-specific multiprocessor is not already available for an application niche, the prohibitive design costs and shrinking market windows make IC development an unattractive option. FPGA solutions alleviate the risks due to silicon development costs and design turnaround times. At the same time, the multiprocessor abstraction retains the advantage of software programmability and provides an easy way to deploy applications from an existing code base. FPGAs also allow the designer to customize the multiprocessor for a target application. Designers can iteratively explore other configurations or offload critical functions into co-processors on the fabric to improve performance. In order to justify the viability of soft multiprocessors, we address the following questions: (a) Can soft multiprocessors achieve performance competitive with custom multiprocessor solutions? (b) How do we design efficient systems of soft multiprocessors for a target application? To demonstrate the effectiveness of soft multiprocessor systems, we empirically evaluate the performance of a soft multiprocessor design for the data plane of the IPv4 packet forwarding application [3]. We construct a 2-port 2 Gbps router as a soft multiprocessor on the Xilinx Virtex-II Pro FPGA. The

soft multiprocessor solution is evaluated with respect to an implementation on the Intel IXP2800 network processor. In the second part of this study, we develop a design space exploration framework to explore efficient multiprocessor configurations for a target application. We construct analytical models of the architecture and application and solve the exploration problem using Integer Linear Programming (ILP). 2. EXPERIMENTAL STUDY: IPV4 PACKET FORWARDING ON A SOFT MULTIPROCESSOR Before evaluating the viability of soft multiprocessors, we present the design process and trade-offs involved in harnessing performance from a soft multiprocessor system. In the following sections, we describe our soft multiprocessor design for a router that forwards IPv4 packets. 2.1. Soft Multiprocessor Systems on Xilinx FPGAs We implement our packet forwarder on a Xilinx Virtex-II Pro 2VP50 FPGA, using the Xilinx Embedded Development Kit (EDK) [1]. The 2VP50 consists of 23,616 slices and 522 KB on-chip BlockRAM memory. The building block of the multiprocessor system is the Xilinx MicroBlaze soft processor IP. The MicroBlaze processor occupies approximately 450 slices (2% of the 2VP50 FPGA area). The soft multiprocessor is a network composed of the multiple soft MicroBlaze cores, the peripherals in the fabric, the dual IBM PowerPC 405 cores, and the distributed BlockRAM memories on the chip. The multiprocessor network is supported by two communication links: the IBM CoreConnect buses and the point-to-point FIFOs. The CoreConnect buses for the MicroBlaze include a bus to access local instruction and data memories, and the On-Chip Peripheral Bus (OPB) for shared memories and peripherals. The CoreConnect Processor Local Bus (PLB) services the PowerPC cores. The point-to-point Fast Simplex Links (FSL) are unidirectional FIFOs. The multiprocessor system is clocked at 100 MHz due to restrictions on the clock rate of the OPB. 2.2. IPv4 Packet Forwarding Application The IPv4 packet forwarding application runs at the core of network routers and forwards packets to their final destinations. The forwarding decision at a router consists of finding the next-hop router address and the egress port to which the packet should be sent. The decision depends only on the contents of the IP header. The data plane of the application involves three operations: (i) check whether the input packet is uncorrupted, (ii) find the next-hop and egress port using the destination address, and (iii) update header checksum and time-to-live fields, and forward the packet. Figure 1 illustrates the data plane of the IPv4 forwarding application.

Fig. 1. Data Plane of the IPv4 packet forwarding application. To handle gigabit rates, routers must be able to forward millions of packets per second. The next-hop lookup is the most intensive data plane operation. The address lookup requires searching the forwarding table for the longest prefix that matches the packet destination address. A natural way to represent prefixes is a tree-based data structure (called a trie) that uses the bits of the prefix to direct branching. There are many variations to the basic trie scheme that attempt to trade off the memory requirements of the trie table and the number of memory accesses required for lookup [4]. We design a soft multiprocessor for the data plane of the IPv4 packet forwarding application. The address lookup operation uses a fixed-stride multi-bit trie. The stride is the number of bits inspected at each step of the prefix match algorithm [4]. The stride order is (12 4 4 4 4 4): the firstlevel stride inspects 12-bits of the IP address and subsequent strides inspect 4 bits at a time, requiring a maximum of 6 memory accesses for an address lookup. An additional memory access is required to determine the egress port for the matched prefix. We allocate 300 KB of memory for the route table. This can accommodate medium-sized route tables with around 5000 entries, suitable for campus routers or DSL multiplexers. In these cases, the route table can be stored entirely within the on-chip BlockRAM (BRAM) memory of the Xilinx 2VP50 FPGA. The design objective is to maximize router throughput. In our experiments, we empirically measure the number of packets processed per second by the multiprocessor design. We compute throughput by multiplying this packet rate with packet size. To model the worst-case scenario for the data plane forwarding performance, we make three assumptions: (a) All packet sizes are 64 bytes - this is the minimum size for an Ethernet frame. (b) All address prefixes in the route table are the full 32 bits in length - hence the trie lookup algorithm takes 7 memory accesses to find the next hop. (c) Results of the prefix search algorithm are not cached - the lookup algorithm must be executed for every packet header. We do not consider control plane processing, such as route table updates and ICMP error messages, since they occur infrequently and hence have negligible impact on the core router performance.

2.3. Soft Multiprocessor Design for Header Processing The forwarding data plane (Figure 1) has two components: IPv4 header processing and the packet payload transfer. We first describe the construction of a soft multiprocessor system for header processing. Figure 2 shows our final multiprocessor design. The micro-architecture consists of multiple arrays of pipelined MicroBlaze processors. Payload Processing

Header Processing Route Table

OPB

Route Table

OPB

32 OCM 32

PowerPC

LMB

Verify

32

Lookup Stage 1

Lookup Stage 2

FSL

PLB

To source microblaze 1 To source microblaze 2

Source MB 64

GEMAC PORT 1

32

Verify

Lookup Stage 1

Lookup Stage 2

Verify

Lookup Stage 1

Lookup Stage 2

Verify

Lookup Stage 1

Lookup Stage 2

BRAM

GEMAC PORT 2

PowerPC

OCM

Key:

BRAM

FSL

PLB Source MB BRAM

MicroBlaze

LMB

Block RAM

OPB

To source microblaze 1 To source microblaze 2

FSL

Fig. 2. Soft multiprocessor system for the data plane of the IPv4 packet forwarding application. We briefly summarize our insights in arriving at this particular design. A starting reference for baseline performance is a single processor solution, where the entire header processing runs on a MicroBlaze. The route table is stored in BRAM and accessed over the on-chip peripheral bus (OPB). Under this scenario, the IPv4 forwarding requires 270 cycles per packet. The maximum throughput that can be achieved by this single processor design operating at 100 MHz is 0.17 Gbps. As a first step towards multiprocessor design, we pipeline the header processing. Each branch of the header processing micro-architecture in Figure 2 is a pipelined array of three MicroBlaze processors along which a single header is processed. FSL links transfer the entire header between processors. The first pipeline stage performs IP header verification. The 6 lookup memory accesses (for stride order: 12 4 4 4 4 4) of the trie lookup algorithm are partitioned equally among the second and third pipeline stages, and hence can be performed in parallel. The third pipeline stage performs an additional memory access to determine the egress port. The trie table is also divided between multiple BRAM modules, and each processor accesses route table memory over a separate OPB bus. For the application decomposition in Figure 2, the throughput of a single array is around 0.5 Gbps. Pipelining is a means to parallelize the application temporally. The next degree of parallelism comes from replicating the pipeline arrays in space. Each header constitutes

a logically independent control flow. Hence, multiple branches can process different headers in parallel. Each branch executes the same decomposition of the header processing application. Two factors restrict the number of branches in the design: (a) BRAM memory constraints on the FPGA bound the number of processors (with a 300 KB route table and 8 KB local memory per processor, the Virtex-II Pro 2VP50 FPGA can allow only 15-20 processors), and (b) branch executions are not independent due to concurrent memory accesses to the route table over a shared bus. Taking area and arbitration constraints into account, the final multiprocessor design for header processing (Figure 2) replicates the single pipeline array into 4 branches. All processors in lookup stages 1 and 2 access the same part of the route table in shared memory over the OPB bus. From experiments, there is a significant drop in OPB performance if more than 2 processors share the same bus. The BRAM memory is dual-ported. Hence, the same route table memory can be serviced by 2 OPB buses. Thus, the choice of 4 branches is optimum for multiprocessor designs where shared resources are accessed over the OPB. The measured throughput of the header processing multiprocessor in Figure 2 is 1.8 Gbps. This is less than 4 times the throughput of a single pipelined array (measured to be 0.5 Gbps). The difference is due to the overhead of accessing the shared route table memory over the OPB by multiple processors. 2.4. Performance Characteristics of the Soft Multiprocessor for Header Processing The breakup of the number of instructions and cycles executed by each pipeline stage of the multiprocessor for header processing in Figure 2 is shown in Table 1. The two IP lookup stages are bottlenecks in the design. Table 2 summarizes area, memory and performance of the multiprocessor for header processing in Figure 2. Area utilization is less than 50% but memory is a tighter constraint. The local memories occupy 14 × 8 = 112 KB, and the routing table occupies 300 KB. The throughput of our router in Figure 2 is 1.8 Gbps. Stage Verify Lookup Stage 1 Lookup Stage 2

# Instructions 64 57 56

# Execution Cycles 97 110 114

Table 1. Execution times for processing one packet header. 2.5. Payload Transfer in the Multiprocessor Design Header processing determines the router forwarding rate. In this section we complete our multiprocessor design for packet forwarding with a mechanism for payload transfer between source and destination ports. The multiprocessor design in Figure 2 shows the payload transfer component

# Processors Area Memory (on-chip BRAM) Throughput

14 (MicroBlaze) 11,250 slices (out of 23616 on 2VP50) 48% utilization 454 KB (out of 522 KB), 87% utilization (major components are 300 KB route table, 8 KB instruction+data memory per processor) 1.8 Gbps

Table 2. Design characteristics of the soft multiprocessor for header processing on the Xilinx Virtex-II Pro 2VP50. and its interface to the multiprocessor for header processing for a 2-port 2 Gbps router. A Gigabit Ethernet MAC (GEMAC) for each port handles packet reception and transmission under the control of the PowerPC processors. The GEMACs transfer the packet header and payload to BRAM memory over the Processor Local Bus (PLB). The header and a pointer to the payload location are then transfered over the On-Chip Memory (OCM) bus into memory that is shared between the PowerPC and the header processing multiprocessor. There is one source MicroBlaze processor per router port, which reads the header from the OCM, transfers the header to the MicroBlaze array, and writes back the processed header back into the OCM. Each packet is transferred over the PLB twice, once during reception and once during transmission. The PLB has simultaneous read and write data paths with a total bandwidth of 12.8 Gbps. This is sufficient to buffer and transfer the packet payload at 2 Gbps line rates. 3. EVALUATION OF SOFT MULTIPROCESSOR SOLUTIONS We evaluate soft multiprocessor systems based on our experimental study of the IPv4 forwarding application. We compare the performance of our soft multiprocessor solution to a software implementation on the Intel IXP2800 network processor. The IXP2800 is a state-of-the-art multiprocessor specialized for packet forwarding applications. It has 16 RISC micro-engines clocked at 1.4 GHz for data plane operations and an Intel XScale processor for control and management plane operations. Meng, et al, report a throughput of 10 Gbps on the IXP2800 for the packet forwarding application for different packet sizes [5]. In order to reliably compare performance between soft multiprocessor and network processor solutions, we normalize the throughput with respect to the area utilization. We estimate the total area of the Xilinx Virtex-II Pro 2VP50 FPGA device to be approximately 200mm2. The area utilization of the FPGA design is measured by number of slices consumed. The header processing subsystem occupies 11,250 slices on the FPGA (Table 2). With the payload processing subsystem and the Gigabit Ethernet MACs in place, we estimate the area of the soft multiprocessor system to be 15,000 slices. This is 63.5% of the total slices on the 2VP50, or around 130mm2 of the total area.

Table 3 shows the relative performance of the IXP2800 and soft multiprocessor solutions for IPv4 packet forwarding. The IXP2800 performs about 2.6X better than the soft multiprocessor for packet forwarding in terms of normalized throughput. This is because the IXP2800 was specifically designed to target forwarding applications. Technology (λ µm) Clock Frequency (M Hz) Area (A mm2 ) Throughput (T Gbps ) Norm. Throughput (T / λA2 )

Soft Multiprocessor 0.13 100 130 1.8 1

IXP2800 0.13 1400 280 10 2.6

Table 3. Performance results for the data plane of the IPv4 packet forwarding application. However, the advantage of soft multiprocessors is evident when we consider the performance-cost trade-off in application deployment. The cost of deploying an application on a target platform has two components: (a) non-recurring development cost, and (b) recurring per-part cost. The perpart cost of both the IXP2800 and the Xilinx Virtex-II Pro 2VP50 FPGA used in our study is around $1000. Typically, the per-part cost of FPGAs is greater than the per-part cost of other platforms of similar area. However, the development cost of a new platform is in the $20 million range and growing. FPGAs are standardized parts and hence incur zero IC development costs. From our experimental study, a soft multiprocessor implementation only lost 2.6X in performance compared to an application specific programmable platform implementation. If no high performance platform exists for an application, it is not always possible to meet the prohibitive cost or market deadline for a new design. In such cases, the platform could be constructed on an FPGA for a modest loss in performance. Soft multiprocessor systems allow a quick and cost-effective deployment for many applications, while obviating the risks in producing silicon. One important consequence of the low development cost is that soft multiprocessors can be used as prototypes for new platform designs. 4. FRAMEWORK FOR ARCHITECTURE EXPLORATION In Section 2, we presented a hand-tuned soft multiprocessor design for packet forwarding and showed that it is only a factor of 2.6X slower than a network processor implementation. However, as the number of processors that can fit on an FPGA increases, the design effort to determine an efficient multiprocessor configuration becomes more labor intensive. Projections from Tensilica Inc. [6] forecast that embedded systems will soon be composed of over 100 processors on a single chip to guarantee acceptable performance. To ease the task of the designer we present a framework to explore the

MB

MB

MB

MB

MB

MB

MB

MB

Number of parallel processos in each stage

design space of soft multiprocessor micro-architectures. At the core of our exploration framework we use Integer Linear Program (ILP). In the recent years, the ILP solvers have advanced significantly [7]. Many large problems can now be routinely solved. Further ILP is very flexible and can be easily adapted to different problem restrictions. In our exploration framework, we explore the design space of array architectures shown in Figure 3. The array architecture can have multiple pipeline stages. A pipeline stage is a vertical column of processors and each stage can have a different number of processors. All processors in a stage perform the same set of tasks. Every processor in a stage receives inputs from the previous stage and transmits outputs to the next stage. To explore this design space we first determine a set of partitionings of the application onto the processors. For each partitioning, ILP is used to determine the best multiprocessor configuration. The best design among these partitionings is synthesized to verify performance. In the following subsections, we detail these steps.

Number of pipeline stages

Fig. 3. Design space of array architectures. 4.1. Application Partitioning The application is represented as a data flow graph. When we partition the application, we only consider partitionings that are ordered according to the data flow graph. This allows us to map each partition onto a single pipeline stage of the array architecture. We cluster application tasks to decrease the number of partitionings in large data flow graphs. The designer can trade off time and accuracy of the exploration by varying the size of the clusters. All valid partitionings are automatically extracted from the clustered data flow graph. For the IPv4 packet forwarding application, we manually divided the data flow graph into 9 different clusters, out of which more than 2000 valid partitionings are automatically extracted. 4.2. ILP Formulation Once all application partitionings are determined, we use ILP to find the best array architecture for each partitioning. The inputs to the ILP formulation are: (a) an application partitioning, (b) profile data for worst case task execution times

and memory requirements, and (c) hardware resource constraints. Several simplifications are made to ease the ILP formulation. First, we assume sufficient resources are available for communication between pipeline stages. Second, we translate resource constraints into constraints on the number of MicroBlaze processors. The exact number of processors that the FPGA can support is difficult to determine. Hence, we evaluate the ILP multiple times with different constraints on the number of processors. The ILP formulation treats the array architecture exploration problem as a flow problem. It models a processor as a node with a flow rate and tries to maximize the overall throughput. The ILP formulation is presented below. Parameters S

:

Set of pipeline stages

J A

: :

Set of architecture constraints Coefficients of architecture constraints

b ti

: :

Bounds on architecture constraints Throughput of processor in stage i, i ∈ S

Ti T

: :

Throughput of pipeline stage i, i ∈ S (T1 , T2 , . . . T|S| )

pi p

: :

Number of processors in stage i, i ∈ S (p1 , p2 , . . . p|S| )

φ

:

Overall architecture throughput

Max

φ

Variables

subject to Ti

= t i pi ,

∀i ∈ S

φ ≤ Ti , ∀i ∈ S Ap ≤ b, (Architecture constraints) |S|

|S|

A ∈ R|J||S| , b ∈ R|J| , T ∈ R+ , p ∈ Z+ In the formulation, the flow rate ti for a single processor in stage i, i ∈ S, is the throughput achieved if a single processor were to execute the tasks assigned to stage i. Since every processor in a stage executes the same set of tasks, the total throughput Ti for stage i is set to ti pi . The overall throughput φ is equal to the minimum throughput across all stages. This is encoded as φ ≤ Ti , ∀i ∈ S. Architecture constraints are used to reflect FPGA hardware limitations. For example, these constraints include limits on the number of processors and on-chip memory capacity. To make the solution meaningful, the number of processors for every stage has to be an integer. Without this integer limitation, the problem would be a simple linear program. Finally, the objective is set to maximize the overall throughput φ.

4.3. Exploration Results We use the lpsolve ILP solver [8] in our exploration framework. We select the best design based on the ILP results and synthesize it to verify performance. If the verification fails, we select the next best design and repeat the process. Figure 4 shows the multiprocessor solution for header processing after the exploration. It contains 3 pipeline stages, with 2 processors in the first stage and 4 processors in the next 2 stages. The IP address lookup contains a total of 7 memory accesses. The first stage involves a single access. The second and third stage both involve 3 accesses. The verify operations are divided between the last 2 stages. The processors in the first pipeline stage process packets at twice the rate of the latter stages. Hence, only half as many processors are needed in this stage. The resulting design balances the workload across all the processors extremely well. In comparison, the hand-tuned multiprocessor design in Figure 2 is less balanced. The first verify stage is slightly underutilized than the latter stages, as seen in Table 1. Consequently, the new design achieves a better throughput of 1.9 Gbps, surpassing the 1.8 Gbps throughput of the hand-tuned design, while using fewer processors. Route Table

Route Table

OPB

Route Table

32

From source microblaze 1

From source microblaze 2

Lookup3

Verify ver & ttl

Verify checksum

Lookup2

Lookup3

Verify ver & ttl

Verify checksum

Lookup1

Lookup2

Lookup3

Verify ver & ttl

Verify checksum

Lookup2

Lookup3

Verify ver & ttl

Verify checksum

MicroBlaze

Block RAM

To source microblaze 1

We thank Akash Deshpande of Teja Systems for suggesting the investigation of soft multiprocessor systems. We also thank Andr´e DeHon for his guidance and comments. 7. REFERENCES [1] Emdedded Systems Tools Guide, Xilinx Embedded Development Kit, EDK version 6.2i ed., Xilinx, Inc., June 2004. [2] H. H. Jones, International Business Strategies Inc., Private communication. (cf. “How to Slow the Design Cost Spiral”, Electronics Design Chain, Volume 1, Summer-2002).

[4] M. Ruiz-S´anchez, E. Biersack, and W. Dabbous, “Survey and Taxonomy of IP Address Lookup Algorithms,” Network, IEEE, Vol.15, Iss.2, pp. 8–23, March-April 2001.

To source microblaze 2

[5] D. Meng, R. Gunturi, and M. Castelino, “IXP2800 Intel Network Processor IP Forwarding Benchmark Full Disclosure Report for OC192-POS,” Intel Corporation, Tech. Rep., October 2003, as reported to the Network Processing Forum (NPF).

To source microblaze 1 To source microblaze 2

OPB

6. ACKNOWLEDGMENTS

[3] F. Baker, Requirements for IP Version 4 Routers, Request for Comments RFC-1812 ed., Network Working Group, June 1995.

OPB

Lookup1

FSL

Key:

Lookup2

From our study, soft multiprocessors on FPGAs only lose a 2.6X factor in performance normalized to area compared to a network processor implementation for the IPv4 packet forwarding application. If a high-performance programmable platform already exists for an application niche, then it is a cost-effective implementation medium. But if such a part is not available, then is it worth $20M to design and manufacture a new IC for this 2-4X performance gain? If not, the FPGA is a viable low cost implementation platform for the same application in software.

FSL

Fig. 4. Multiprocessor design solution for header processing after automated exploration.

5. CONCLUSIONS In this paper, we evaluated the effectiveness of FPGA-based soft multiprocessors for high performance applications. We designed a soft multiprocessor for the data plane of the IPv4 packet forwarding application and achieved a throughput of 1.8 Gbps. We also developed a design space exploration framework for soft multiprocessor micro-architectures. Using this framework, we designed a more efficient multiprocessor that achieved a 1.9 Gbps throughput surpassing the performance of our hand-tuned design.

[6] Chris Rowen, Tensilica Inc., “Fundamental Change in MPSoCs: A fifteen year outlook,” in MPSOC’03 Workshop Proceedings. International Seminar on Application-Specific Multi-Processor SoC, 2003. [7] A. Atamt¨urk and M. W. Savelsbergh, “Integer Programming Software Systems,” IEOR, University of California at Berkeley, Tech. Rep. BCOL.03.01, January 2003. [8] M. Berkelaar et al., “lpSolve version 1.1.9: Interface to Lp solve version 5 to solve linear and integer programs,” April 2005, URL: http://cran.rproject.org/src/contrib/Descriptions/lpSolve.html.