evaluation of on-chip multiprocessor architectures for

0 downloads 0 Views 186KB Size Report
The worst-case map portion was searched in the chart ... The worst-case map portion has been used in the ..... of Design Alternatives for a Multiprocessor.
EVALUATION OF ON-CHIP MULTIPROCESSOR ARCHITECTURES FOR AN EMBEDDED CARTOGRAPHIC SYSTEM ALESSIO BECHINI and COSIMO ANTONIO PRETE Dipartimento di Ingegneria dell’Informazione Facoltà di Ingegneria - Università di Pisa via Diotisalvi, 2 56100 Pisa, Italy {a.bechini,prete}@iet.unipi.it

ABSTRACT Embedded systems with complex graphical interfaces require significant computational power. Moreover, low power consumption and low cost are usually strict specification constraints. A possible solution for addressing these conflicting needs is the adoption of a simple multiprocessor on a single chip, using low-cost CPU cores. In this paper, we consider a cartographic system to be deployed on hand-held devices, and we present the methodology used for designing the multiprocessor architecture for its hardware platform. Whenever large chip productions are involved, the multiprocessor can be specialized to meet the software requirements of embedded applications. The proposed design process is based on the following steps: Workload definition; Definition of a pool of eligible architectures; Simulation of the software workload; Comparison and analysis of simulation results. In this scenario, tracedriven simulations are aimed at evaluating performance of time-critical paths of typical user activities. The results are used for a proper architecture tuning, determining several architecture parameters (such as the number of CPU cores, the number and width of internal buses, the cache parameters, etc.). The outcome of the design process for the specific system considered in this paper is an architecture with ARM cores, able to support cartographic applications at low cost and low power consumption. Keywords: Performance Evaluation; Trace-driven Simulation; Multiprocessor Architectures; Embedded Systems; On-Chip Multiprocessors.

1

INTRODUCTION

Embedded systems require an increasing computational power and, at the same time, have to keep low the cost. Moreover, low power consumption may be required on hand-held devices [1]. A possible solution for obtaining these features relies in the adoption of multiprocessors on a single chip [2] [3], reusing low-cost

and low-power processor cells already present on the market. Whenever large productions of the chip are involved, the multiprocessor can be specialized to meet the dedicated software application requirements. Goal of this paper is the presentation of the methodology we have used for selecting and tuning the multiprocessor architecture for a cartographic system, in the framework of the Esprit Project “SPP”. We consider cartographic systems to be deployed on hand-held devices with a LCD display, supporting GPS as well. The design process takes into account the specific behavior of cartographic software, in terms of use of system resources (CPU, memory, LCD, other peripheral devices). In this context, we selected an architecture with two ARM 710 cores. ARM is a 32-bit microprocessor that uses RISC technology and a fully static design approach to obtain both high performance and very low power consumption.

Workload Definition

Architecture Definition

Trace-Driven Simulation

Simulator Upgrading

Result Evaluation

Selected Architecture

Figure 1 – Graphical description of the design methodology. The first problem to be addressed is the workload definition, i.e. the characterization of the software activity and the selection of the input data responsible for the heaviest and an average computational load. The steps must be repeated until the simulation results comply with the specifications on the system performance. The feedback paths (coming from hints for subsequent improvements) are shown with thin black arrows.

10946549

automatic grid 10446549

chosen points

9946549

9446549

latitude

Low cost, low power consumption, low-voltage operation high performance, and compact design make ARM suitable for embedded cartographic systems. The performance evaluation required in the design process is carried out by means of trace-driven simulations [4] [5] [6]. The design issues in the architecture development are: i) the benefits and characteristic s of a caching system, ii) the possible choices for the interconnection network among CPUs, caches, and main memory, iii) the influence of the video traffic on the bus due to the LCD controller. The simulations carried out during the whole design process took into account the major computational activities for the system, i.e. chart plotting and GPS services. We have tackled also other problems, such as the choice of a low-level mechanism for synchronization and mutual exclusion, the way to guarantee coherence of data within caches, and the implementation of inter-processor interrupts. Both straightforward and efficient solutions to these problems are required by any simple operating system, whose task is the exploitation and management of the chipset. Anyway, as they are not crucial in the design process, here we omit to talk extensively about them.

8946549

8446549

7946549

7446549

6946549

6446549 -556627

-56627

443373

943373

1443373

1943373

longitude

Figure 2 – Evaluation of computational load in map 2

METHODOLOGY

The design of a chip for an embedded cartographic system is based on performance evaluation of time-critical paths of typical user activity [7]. This kind of performance analysis can be applied to a number of different possible architectures. The design process includes some basic steps: 1.

2. 3. 4.

Workload definition, pointing out both the relevant software activities and their dependencies from the input data. Definition of a pool of eligible architectures. Simulation of the software activity over each architecture in the pool. Comparison and analysis of simulation results, and selection of an architecture in the pool, according to given criteria.

The performance evaluation is based on trace-driven techniques [4]. The previous steps require other additional side-activity, in tuning the system simulator and in preparing the actual workload for the simulator. The trace-driven simulation requires a set of traces of memory operations, to describe actions performed by each processor (CPU core) in the architecture [6]. The used traces have to describe carefully the software activity selected in the workload. Unfortunately, the available software is often plainly sequential, and after traces have been collected, an analysis and adjustment phase takes care of dispatching them among the cores [8]. Moreover, it may happen that the simulator is not able to properly deal with an eligible architecture. Thus, it has to be upgraded, making it suitable to model and cope

redrawing. The plotting time on the LCD display depends on the map portion to show, and on the map complexity. The area of each bubble in the diagram is proportional to the plotting time in the corresponding position on the map.

with all architectures in the selected pool. Hints for this kind of operation come from the inspection of simulation results. Figure 1 illustrates how the different activities in the design methodology connect to each other. The feedback paths represent the hints, coming from the inspection and analysis of results, for refining both the workload and the architecture pool, and for updating the simulator. 3

WORKLOAD CHARACTERIZATION

The main time-critical situation for the system is the map plotting on a LCD screen: This activity has to be carefully dealt with, because it directly impacts on the user of the hand-held device. At the same time, the cyclic execution of GPS algorithms takes place: It is in charge of updating the current geographical position of the device. Once the workload has been roughly defined, it has to be better characterized, investigating on the input domain. Specifically, we have to find out the data causing the heaviest computational load. In our study, we used software currently employed in cartographic plotters, which exploits two libraries with high- and low-level graphics functions. After a quick preliminary analysis of this software, it can be found that the execution time of map redrawing depends on the specific portion of the map (stored on a peripheral device called “C-card”) to show on the LCD display.

Particular attention must be devoted to choose the map portion that yields the longest execution time, in order to consider it as the worst input case for the system. Moreover, a map of medium complexity can be used to figure out the average operating conditions for the system. 3.1

Table I Pool of eligible architectures for the SPP chipset

Looking up for test maps

The worst-case map portion was searched in the chart of the Southern Norway coastlines, because of the complexity coming the richness of represented features, both geographical (islands, fjords, mountains, etc.) and symbolic (bridges, lighthouses, gas stations, etc.). An automatic procedure for scanning the whole chart has been implemented, in order to find the map portion that yields the longest plotting time. It’s worth noticing that a chart has to be scanned changing the geographical point to display in the center of the screen, and then varying the zoom factor within the possible zooming range. Figure 2 shows some sample results of this scanning procedure, in different points on the chart. The worst-case map portion has been used in the trace generation as input data for the plotting procedure. A map of Elba Island, in Italy, was chosen as a representative of a medium complexity case as well. 3.2

GPS workload

The GPS computation was evaluated using a set of standard algorithms [9]. Various operating conditions have been tested, considering also cases of re-acquisition after satellite signal loss. Analyzing the corresponding traces, it becomes manifest that GPS activity doesn’t need significant processing power. Anyway, the worst-case GPS computation has been included in the system workload, considering that the most time-consuming GPS

Redraw time (seconds)

0.9 0.8 0.7

ID

Description

a1 a2 a3 a4 a5

No cache (used only for comparison purposes) Single cache (16 – 32 kBytes), shared among cores One private cache (4 – 8 kBytes) for each core One private instruction cache (8 kBytes) for each core Private instruction caches (4 kBytes), and a generic shared cache (4 – 16 kBytes) Multiple caches (4 – 8 kBytes) operating on different address spaces Private instruction caches (4 – 8 kBytes), and a single data cache (4 – 8 kBytes) shared among cores

a6 a7

procedure would be completely overlapped with the map redrawing process. 4

A POOL OF ELIGIBLE ARCHITECTURES

The next step of the design process consists in selecting a number of different architectures, in order to discover among them that one that better fits the product requirements. The more the selected architectures are suitable to support efficiently the cartographic software, the more effective will be the simulation and comparison steps. The following observations help in populating a pool of eligible architectures. 4.1

Symmetric vs. asymmetric architectures

Considering that cartographic applications are continuously evolving, the lack of flexibility hampers the adoption of an asymmetric architecture. On the other hand, a solution with anonymous processors is more suitable to host updated versions of the cartographic applications. Even if an asymmetric architecture could yield impressive performances, it cannot be taken into account for maintenance and easy-deployment issues.

0.6

4.2

0.5

An important design choice is the bus width. It’s worth adopting a large bus only if it gives considerably better performances, respects to narrower ones. Figure 3 shows how the bus width influences redraw timings in the case of the SPP chipset. The bus widths analyzed are 16 bits, 32 bits, and 64 bits. It is immediately clear that the adoption of a 64-bit bus doesn’t yield any special payoff. The main advantage in choosing a 16-bit bus is the silicon saving. On the other side, a 32-bit bus gives a higher tolerance on the LCD refreshing traffic, a simpler interface with the ARM cores, and a better performance (12% more, respect to the 16-bit bus). Whenever it is used a unique internal bus in the system, the traffic due to LCD refreshing is “overlapped” to the traffic towards main memory. This fact might cause bus congestion. Thus, for avoiding time overhead due to bus contention, we could take into consideration also

0.4 0.3 0.2 0.1 0 16

32

64

Bus width (bits)

Figure 3 – Impact of bus width on map redraw time. The modest performance improvement given by a 64-bit bus does not persuade about its adoption. The results in this figure are obtained simulating the execution of high-level graphic functions, over an architecture with 4Kb private caches, considering a video traffic (LCD 512x384) of 25Mpixel/sec 60 Hz. The access time for the burst-access RAM is assumed to be 100 ns for the first word, and 28 ns for the following sequential ones.

Bus width and number of buses

Redraw time for a typical map (Elba Island)

11.55

Redraw time for a worst-case map (Norway Coastlines)

7

6

Redraw time (seconds)

Redraw time (seconds)

3

2

1

5

4

3

2

1

a7 (8k-4k)

a7 (4k-4k)

a7 (4k-8k)

a6 (8k)

a6 (4k)

a5 (4k-16k)

a5 (4k-8k)

a4 (8k)

a5 (4k-4k)

Architectures

a3 (8k)

a3 (4k)

a2 (32k)

a2 (16k)

0

a1

a7 (8k-4k)

a7 (4k-4k)

a7 (4k-8k)

a6 (8k)

a6 (4k)

a5 (4k-8k)

a5 (4k-16k)

a4 (8k)

a5 (4k-4k)

Architectures

a3 (8k)

a3 (4k)

a2 (32k)

a2 (16k)

a1

0

Figure 4 - Redraw time in the average and worst case for each architecture in the studied pool. Results for architecture “a1” (no cache) are shown only for visualizing the performance improvement due to the adoption of a caching scheme. The two diagrams present a similar shape, besides of the scale factor. In this particular situation, the quickest redraw time is obtained by the same architecture in both the average and the worst input case. The results presented here are obtained for a video memory page of 512x384 pixels with no graphic engine (software low-level graphics). The video traffic is assumed to be 25Mpixel/sec 60 Hz.

architectures with a dedicated on-chip bus: This can be done designing a separated bus for supporting mainly the LCD traffic. However, preliminary simulations show that such bus splitting does not yield a significant performance improvement. The speedup on plotting time obtained in this way is less than 4% in every architecture of interest, while the cost in terms of chip die size becomes unacceptable. Therefore, even if in the first place we deemed as eligible also this kind of architectures, later we decided for their disposal. From the observations presented in this section, we can see that, for the selected workload, the bus system does not represent a very bottleneck, neither for the number of buses, nor for their width. We can thus try to improve the overall performances designing proper caching schemes. 4.3

Searching for possible caches

A crucial role in the architecture design is played by the caching scheme. Caches, in order to be really useful and effective, need to be carefully tuned [10] [5]. Many architectures in the pool will be different from each other because of the caching scheme; Moreover, the same architectural structure can be used for several solutions, differing only for the values of the cache parameters. Table I shows the cache values used in the architecture pool for the SPP chipset. A first result from simulations is that it isn’t possible to make the system respect the specification timings without memory caching. Through simulations, we can determine what’s the software process causing the highest

number of cache misses: such a process is used for tuning the cache parameters. Considering the speed-up due to the introduction of different types of caches, we can select a range of possible values for the cache parameters. After these considerations, the resulting ranges selected for the SPP chipset are: size: 4-8 kBytes; block size: 8-16 bytes; associativity: 4 ways. Taking into account the previous observations, we have defined a pool of eligible architectures for the SPP chipset. Table I summarizes it, giving an ID to each architectural solution, and specifying the range for cache parameters adopted in each case. The first architecture “a1” has no caching scheme, and it has been inserted in the pool just as a comparison for the other, likely more viable solutions. 5

ARCHITECTURE EVALUATION

Once the workload and the pool of eligible architectures have been defined, trace-driven simulations allows performance evaluation of cartographic applications over such pool. This process is aimed at gathering information on performance-related aspects of each eligible solution. In the evaluation of the chip architecture, we have to consider also the architecture of the different systems it will be plugged in. In fact, the access timings for all the devices external to the chip (e.g. RAM memory, controllers, etc.) affect the overall system performance (even if cache memory usually has a heavy decoupling effect). For this reason, we have simulated the behavior of each eligible solution on two different products,

Area occupied by cache memory

Total cache area (rbe x 1,000)

Arbiter 150

IRQ-CTRL

ARM

ARM

Internal peripherals

MMU + Cache

MMU + Cache

125 100

LCD-CTRL

75 50

On-chip Bus

25 Bus Interface

Chip borders

a7 (8k-4k)

a7 (4k-4k)

a7 (4k-8k)

a6 (8k)

a6 (4k)

a5 (4k-8k)

a5 (4k-16k)

a5 (4k-4k)

a4 (8k)

a3 (8k)

a3 (4k)

a2 (32k)

a2 (16k)

0

Out-of-chip Bus

Architectures External peripherals

Figure 5 – Estimation of the chip area (in rbe) occupied by cache memories in the studied architectures. An rbe corresponds to the area of a one-bit cell. The values in this diagram refer to architectures with two ARM cores, and do not take into account the die size due to interconnections (much larger in architectures with two levels of caches).

employing in the first case (low-end product) cheap RAM memory and a small LCD display, and in the second case (high-end product) a quicker memory and a wider display. Different parallel architectures can be compared making use of a variety of performance metrics proposed in the literature [11]. Anyway, especially in embedded systems, the plain execution time is the basic parameter for architecture evaluation. The selection of the most appropriate solution is carried out using some given performance indexes. In our particular case, the simplest performance index is the time spent in redrawing a map on the LCD display. Such index can be measured for the plotting of both the most complex and the typical map, according with the worst and average test cases chosen in the previous workload definition phase. Figure 4 shows the redraw time in the average and worst cases for all the architectures studied in the SPP project, as reported in Table I. In this particular situation, the quickest redraw time is obtained by the same architecture for both the average and the most complex map portion. There are many other issues to take into account in selecting a particular architecture, and most of them have to do with the production costs. The corresponding chip area is one of the most important. Even if it is very difficult to predict the exact chip dimensions from the plain architectural scheme, it is possible to estimate them using heuristic methods, starting from the actual cache parameters [10]. The diagram in figure 5 gives an estimation of the total area needed by cache memory in every studied architecture.

Ram

C-Card Input

Figure 6 – The architecture selected for the SPP chipset. The components placed out of the chip borders may have different response (or access) times, and this issue severely influences the system performance. For this reason, the external components have to be properly modeled within the simulation environment as well. The module called “C-card” is a specialpurpose device for storing digital maps.

The architecture actually selected for the SPP chipset is that one named “a3,” with two CPU cores with private caches of 8 Kbytes, 4 ways, 16 byte blocks, with copyback writing policy. It has been chosen not only because of its quick redraw time, but also because of its simplicity respect to other architectures with analogous performances, and its reasonable occupancy in terms of chip die size. Figure 6 shows a schematic view of the architecture selected for the SPP project, distinguishing components to be implemented inside the chip and external components (either affecting the system performance). 6

CONCLUSIONS

The market of embedded systems is getting an increasing importance, and for this reason the cost of the final product and power consumption are crucial issues to be addressed, especially in hand-held devices. In this context, the architecture design and tuning plays a central role. We have shown how, in the field of cartographic embedded systems, a dedicated architecture can be designed starting from the application software features. After an initial analysis, the processors on the market didn’t result suitable for the SPP project specifications, because of their cost and their power consumption. These reasons lead to the design of an on-chip multiprocessor based on ARM cores.

The architecture tuning (i.e. the selection of the number of CPU cores, the number and width of internal buses, the cache parameters, etc.) has been done following a methodology based on performance evaluation, making use of trace-driven simulation. In this process, the most complex phase was the workload definition in the worst operating condition for the cartographic system. The final result for the SPP chipset is a dualprocessor architecture with ARM cores, providing enough computational power for cartographic applications at low cost and low power consumption. 7

ACKNOWLEDGMENTS

The present work has been carried out in the framework of the Esprit Project SPP, “Scalable Peripheral Processor,” contract no. 29173. The project consortium parties are: C-Map, Marina di Carrara, Italy; Alcatel Microelectronics, Zaventem, Belgium; Cetrek, Poole, UK; Centro TEAM, Pisa, Italy and Dipartimento di Ingegneria dell’Informazione- University of Pisa, Italy. The authors are grateful to Giampaolo Scalone, Paolo Castelletti and Filippo Martinelli, who contributed to the results presented in this paper. 8

REFERENCES

[1] M. Schlett, Trends in Embedded-Microprocessor Design, IEEE Computer 31(8), 1998, 44-49. [2] L. Hammond, B. A. Nayfeh, K. Olukotun, A SingleChip Multiprocessor, IEEE Computer 30(9), 1997, 79-85

[3] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, The Case of a Single-Chip Multiprocessor, in Proc. of ASPLOS 1996, 2-11. [4] C. A. Prete, G. Prina, and L. Ricciardi, A TraceDriven Simulator for Performance Evaluation of Cache-Based Multiprocessor Systems, IEEE Trans. Parallel and Distributed Systems, 6(9), 1995, 915929. [5] C. A. Prete, M. Graziano, and F. Lazzarini, The ChARM Tool for Tuning Embedded Systems, IEEE Micro, 17(4), 1997, 67-76. [6] R. Giorgi, C. A. Prete, G. Prina, and L. Ricciardi, Trace Factory: Generating Workloads for TraceDriven Simulation of Shared-Bus Multiprocessors, IEEE Concurrency, 5(4), 1997, 54-68 [7] B. A. Nayfeh, L. Hammond, K. Olukotun, Evaluation of Design Alternatives for a Multiprocessor Microprocessor, in Proc. of ISCA 1996, 67-77 [8] L. Hammond, M. Willey, K. Olukotun, Data Speculation Support for a Chip Multiprocessor, in Proc. of ASPLOS 1998, 58-69. [9] D. Elliot, Understanding GPS: Principles and Applications, Kaplan 1996. [10] M. J. Flynn, Computer Architecture: Pipelined and Parallel Processor Design, Jones and Barlett Publishers, 1995. [11] S. Sahni, V. Thanvantri, Performance Metrics: Keeping the Focus on Runtime, IEEE Parallel and Distributed Technology, 4(1), Spring 1996, 43-56.