Multiprocessor System-on-Chip Profiling Architecture - CiteSeerX

6 downloads 1890 Views 986KB Size Report
effective tool for profiling the behavior of the MPSoC system is in great need. ... An effective tool to profile the ..... role in Linux kernel, which behave like a black box that makes a ... device drivers are more suitable for our monitor hardware.
2009 15th International Conference on Parallel and Distributed Systems

Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen1, Chung-Ta King2, Yuan-Ying Chang2, Shau-Yin Tseng3 1

Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan, ROC 2

Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, ROC

3

SoC Technology Center, Industrial Technology Research Institute, Hsinchu, Taiwan, ROC

[email protected], {king, elmo}@cs.nthu.edu.tw, [email protected] ABSTRACT operating system to manage the resources and tasks. This With the growing needs for advanced functionalities in modern increases the difficulties in tracking and understanding the embedded systems, it is now necessary to integrate multiple internal behaviors.

processors in the system, preferably on a single chip, to support the required computing complexity. The problem is that such multiprocessor system-on-chip (MPSoC) architecture is very complex and its internal behavior is very difficult to track. An effective tool for profiling the behavior of the MPSoC system is in great need. Such a tool is very useful during system design for exploiting various options and identifying potential bottlenecks. In this paper, we introduce the MultiProcessor Profiling Architecture (MPPA) -- a general framework for profiling MPSoC embedded systems. The MPPA framework entails the use of FPGA emulation for the target system, the embedding of performance counters for recording system events, and the development of OS drivers for collecting the profiled data. To demonstrate its use, we show the implementation of an MPSoC emulation system based on Leon3 cores following the MPPA framework. We also show how the MPPA framework and the emulator help the designers to identify performance problems and improve their MPSoC embedded system design.

Designing MPSoC-based systems is even more complex. There are so many design options to exploit and the design must factor in everything from applications, operating system, to hardware. For example, poor workload distribution in applications among the multiple processors may cause excessive cache interferences, which degrade the system performance. An effective tool to profile the target MPSoC system for exploiting design options and identifying potential bottlenecks is in great need. An ideal profiling architecture for prototyping MPSoC should have the following features. First, it must allow concurrent monitoring of all processors and system-wide events in the MPSoC. Second, it should be less intrusive to the target system, e.g. without requiring new instructions or extra dedicated bus. In this way the target system can behave similarly with and without profiling. Third, software support for controlling and accessing the collected data is necessary and should be carefully designed with flexibility and scalability.

Categories and Subject Descriptors C.4 [Performance of Systems]: Performance attributes General Terms Measurement, Performance, Design

In this paper, we propose a light-weight multiprocessor profiling architecture (MPPA) for prototyping embedded systems during design phase. With this architecture, designers can easily collect low level events of the target system without extensive modifications to the original architecture. MPPA is based on FPGA emulation, in which the target system is implemented and emulated using FPGA [6]. On the same FPGA, MPPA lays out extra hardware to profile the behavior of the system. It consists of two parts: event sensing and event collecting. The event sensing part relies on hardware counters to collect statistics of interest. The event collecting part can be realized by various approaches, e.g. special coprocessor interface or dedicated bus with privilege instructions. We show how this can be done by leveraging the target architecture.

Keywords MPSoC, multiprocessor, profiling, architecture, design, monitor

1.

INTRODUCTION

With the growing needs for advanced functionalities in modern embedded systems, it is now necessary to integrate multiple processors in the system, preferably on a single chip, to support the required computing complexity. Embedded systems built using multiprocessor system-onchip (MPSoC) are becoming popular [1,2,4,6]. Embedded systems based on MPSoC are complex. Designing them is even more complex. MPSoC-based systems are complex because they contain multiple resources and execute multiple tasks, which interact with each other in complex ways. The systems often require an 1521-9097/09 $26.00 © 2009 IEEE DOI 10.1109/ICPADS.2009.118

The remainder of the paper is organized as follows. In Section 2, we will discuss some related works. Section 3 519

describes the proposed multiprocessor profiling architecture MPPA. In Section 4, we introduce an implementation of MPPA based the LEON3 cores. Section 5 shows the synthesis result of the system on FPGA board. Finally, we show the usability and feasibility of MPPA through a case study. Conclusions and future works are given in Section 6.

2.

introduced. With sampling profiling, it may lead to inaccurate profiles due to system interrupt overhead. As we have discussed, there are different approaches to performance profiling. Most of them require additional instructions and dedicated bus or interface to access the collected data. In order to apply these architectures, extensive modifications to the target design are needed. This is undesirable and may introduce extra uncertainty into the MPSoC design. Besides, such modifications will incur heavy overhead for functional verification and testing.

RELATED WORK

There have already had a large body of research on performance profiling for computing systems. For example, MAMon [7] focus on hardware and software debugging of single and multiprocessor SoC. Low-level logic and system level events are recorded with a timestamp inside the buffer memory of the probe unit and are later sent to the host computer for analysis and debugging. All the target event signals must be hard-wired connected to the probe unit from the target cores. MAMon requires extra MMU, buffer memory, and dedicated external interface to collect the profiling data.

3.

ARCHITECTURE OVERVIEW

In this section, we will first describe our ideal profiling architecture. Next, our profiling architecture, MPPA, is proposed.

3.1

Ideal profiling architecture

MPSoC contains multiple processors connected by an interconnection bus or network. All the processors can access the shared memory. To profiling such a system, the profiling architecture must have the following features.

To simplify the design complexity, many profiling systems only count events instead of recording every detail of the events [16, 18, 21]. Event counts can readily convey a lot of useful information, such as cache misses or pipeline stalls during runtime. It can be implemented using simple circuits, i.e. registers. Most modern processors have already had built-in performance counters to assist performance profiling and tuning. Therefore, our MPPA architecture will also adopt this approach.

Support of multiprocessor architecture. Most modern processor cores have already had performance counters built in or performance monitor units as coprocessor. They could only be accessed by the processor core itself. However, for MPSoC, we are more interested in the behavior of the whole system, including memory, bus, and I/O. Without special supports, there is no way for a processor core to access other cores’ performance counters. As a result, an ideal multiprocessor profiling architecture must be outside of the processor cores and monitor the system as a whole.

With the performance counters built in, the most important issue next is to collect the counter values and transfer to the host. The ARM architecture supports a mechanism for extending the instruction set through the addition of coprocessors [18]. A performance monitoring coprocessor can thus be developed through this interface. There are two event counters in the coprocessor and their states are controlled by special instructions, MCR and MRC.

Minimal modification to the original design. If the target MPSoC system does not support profiling mechanisms, it will take a lot of efforts to develop one for the system. Furthermore, if the profiling is done only at the design and prototyping stage and will be removed on production, it is also better not to modify the target system too much for profiling. As a result, we should develop an architecture that does not modify the target system too much, including adding new instructions for accessing the performance counters. This also calls for a simple and slim profiling architecture design.

The IBM PLB (Processor Local Bus) Performance Monitor (PPM) provides dedicated hardware for counting certain events associated with PLB bus transactions [16]. The DCR (Device Control Register) bus is used to transfer data between the CPU’s general purpose registers (GPRs) and the control registers of DCR slave logic device such as PPM [15]. The profiling architecture is built on proprietary interfaces and thus is difficult to generalize. Also special instructions are needed to access the counters.

Systematic architecture design. There are various ways to design the performance counting mechanism. Unfortunately, there does not seem to have a systematic way of deciding where to add counters and how to collect their values. For example, if we want to measure the data cache, we may add counters inside the data cache. But, how can the processors access them? One idea is to map these performance counters to special system registers. The problem is that the system address space is often limited

Bhattacharjee et al. proposed a FPGA-based power emulator [21], which is implemented on Xilinx Virtex-II Pro 70 FPGAs. The emulator has component-specific (i.e. pipeline, register file and caches) event counters to evaluate power consumption of the system using appropriate power models. To access the counter values, extra instructions are

520

and not scalable. More importantly, it needs special system instructions to access these counters under the privileged mode. A systematic architecture design can reduce the hardware complexity, increase the software scalability and also lower the overhead for accessing counters.

3.2

Proposed profiling architecture: MPPA

Our proposed profiling architecture, MPPA, is shown in Figure 1. MPPA can be divided into two components: event sensing and event collecting. The event sensing part is responsible for detecting specific hardware events and notifying the event collecting part about the occurrences. The event collecting part is responsible for accumulating event counts from the event sensing part.

Figure 2. Proposed MPPA and mechanisms.

3.3

MPPA design flow

In order to apply our MPPA, we propose a design flow to guide users through the design phase as Figure 3 shows. In the beginning, users should decide what events they want to monitor. For example, a common monitoring event would be the cache miss ratio for single processor and cache coherence miss for multiprocessor system. After event selection, we need to understand the detail hardware architecture of the component to be studied by reading hardware specifications. Next is the implementation phase, in which users should first understand the source code and find the location of the target component in the code. Second, users should find the point that triggers the target events and embeds event detectors to that component for event notification. Third, this process is iterated until all the event detectors are added. The last two steps will build the connections between event detectors and the monitor module, and set up the corresponding counters inside the monitor module and finally map these counters to a segment of the memory address space for accessing.

Figure 1. Basic idea of proposed profiling architecture.

In order to monitor and analyze low level events, we embed multiple event sensors in components and interconnection. These sensors form the event sensing part. All event sensors are connected to the event collecting part, i.e., the monitor module. When the monitored event occurs, these sensors will pass a signal to the monitor module. In order to control and access the monitor module, the monitor module is mapped to a segment of the memory address space for memory-mapped addressing.

There are no architecture-dependent steps in this design flow, so the developers can follow it to design their profiling frameworks to any architecture.

Figure 2 shows the profiling flow of MPPA. The monitor module is responsible for collecting event signals and acts as a controller to communicate with the processors. Processors can enable and disable profiling through start (enable-counting-events) and stop (disable-counting-events) kernel routines as steps 1 and 3 show. When profiling is started, the monitor module will start counting event occurrences as step 2 shows. At the end of profiling, the monitor module then sends performance statistics back through the system interconnection bus as step 4 shows. The clear command (reset-counter-values) is provided to reset the counters.

4.

IMPLEMENTATION ON LEON3

In this section, we illustrate an implementation of the proposed profiling architecture on the LEON3 platform. We will first introduce Leon3 platform. In Section 4.2 and 4.3, we will present the hardware and software design and implementation of our proposed multiprocessor profiling architecture on the LEON3 platform, respectively.

521

Gaisler provides various software tools and operating systems for LEON3 system [12]. Linux support for LEON3 is provided through a special version of the SnapGear Embedded Linux distribution [14]. The SnapGear Linux supports both MMU and non-MMU LEON configurations, as well as symmetric multiprocessing (SMP). A single cross-compilation toolchain is provided which is capable of compiling the kernel and applications for any configuration.

4.2

Figure 3. Integration of MPPA and LEON3 architecture.

Figure 3 MPPA design flow

4.1

MPPA hardware implementation

Figure 4 shows MPPA profiling architecture on the LEON3 platform. Two processors and one monitor module are attached to the AHB system bus. Each processor has a connection to the monitor module for passing the monitoring signals. The monitor module plays the role of collecting the performance event statistics as shown in the top half of the figure. The circles inside each processor represent event sensors which monitor the target components. Once the monitoring event occurs, event sensors will notify the monitor module by pulsing signals.

LEON3 architecture

After the measurement, processors can read out the event statistics from the monitor module through the system bus. Since the performance counters are memory-mapped, processors can access these counters similar to memory accesses through the system bus. This method simplifies both hardware and software design, avoiding dedicated interfaces and extended instructions. Since the monitor module is attached to the AHB bus, it has the capability to intercept all the bus transactions and therefore to do the bus performance metrics.

The LEON3 processor is a 32-bit synthesizable software processor core based on the SPARC V8 architecture with a seven-stage pipeline and multi-processor support [11,13,17]. It is distributed as part of the open source GRLIB IP library from Gaisler Research, implemented by synthesizable VHDL model. The processor design is highly configurable, and particularly suitable for SoC designs. Multiple LEON3 processors can be configured to form a symmetric multiprocessing architecture through cache snooping units. The GRLIB IP library is a bus-centric design, in which all the IP cores are interconnected through the on-chip bus. The AMBA-2.0 AHB/APB can be used as the on-chip bus, and all the LEON3 cores have a compliant AHB/APB bus interface to communicate with other components [10].

Figure 5. Monitor module block diagram.

Inside the monitor module, there are multiple blocks representing performance counters used for counting the monitored events. All these counters can be accessed through the mapped memory address and AHB bus interface. Figure 5 shows the high-level hardware block diagram of the monitor module. Monitor can be roughly divided into two parts: monitor core and AHB interface.

Figure 4 Integration of MPPA and LEON3 architecture

522

accessing I/O peripherals. Device drivers play a special role in Linux kernel, which behave like a black box that makes a particular piece of hardware responding to a welldefined programming interface. This programming interface allows the driver to be built separately from the rest of the kernel and plugged in at run-time when needed. Besides, accessing counter values through device drivers incurs small overhead (i.e. less than 5 cache miss) and also has better scalability compared to the system call solution. So we use device drivers as the software solution in our design.

Monitor core is responsible for receiving event signals, manipulating the corresponding counter values, and handling read accesses from processors through the AHB interface. It can be further divided into three parts: monitor controller (MC), counter value manager (CVM) and event counters (ECs). Monitor controller controls the whole monitor module with the ability to enable or disable the CVM and clear the ECs’ values. If the CVM is disabled by MC, then it will not monitor the input event signals, and thus will not update the EC values. On the other hand, if CVM is enabled, it will update EC values according to the input event signals. CVM is responsible for manipulating the counter values and monitoring the input event signals. Once an event occurs, CVM retrieves and updates the corresponding counter value according to the input signal. Event Counters are composed of registers and are the place to record event occurrences. They are accessible through the AHB interface. The bus interface is based on the AHB specification to allow memory mapped accesses from processors.

4.3

MPPA Linux device driver implementation. Linux device drivers can be categorized into three types: character, block, and network drivers [3]. In MPPA, statistic data are generated from 32-bit counters. As a result, character device drivers are more suitable for our monitor hardware. Basic operations of character device drivers are ‘open’, ‘read’, ’write’, and ‘close’. Character devices are accessed by means of file-system nodes through the basic predefined operations. Linux defines a dedicated data structure called ‘file_operations’ for mapping the standardized operations to the specific driver functions.

MPPA software implementation

Software support of our profiling architecture can be categorized into two parts: one is the manipulation of the monitor module, i.e., to control the status of the monitor module, and the other is the counter value accesses.

When a user program starts to execute, it may open a device node through the open operation defined in the ‘file_operations’ structure. After some initialization steps, it can communicate with the hardware device by read or write operations and finally it closes the device node to end the I/O communication.

The functionalities of the monitor controller are enablesensing, disable-sensing, and clearing counters, all are controlled by the control register. Setting the corresponding bit of the control register means to trigger (enable) the functionality. So from the software aspect, we need a way to modify the bits of the control register to enable or disable event counting and trigger the clear function.

Since the I/O memory region is protected under the kernel mode, the performance overhead of user-kernel mode switching is one of the most critical issues to be solved. In Linux, one technique which can solve this problem is called ‘mmap’ (memory mapping). The basic idea of device memory mapping is to associate a range of userspace addresses to device memory. Whenever the program reads or writes in the assigned address range, it is actually accessing the device. For a performance-critical application, direct access makes a big difference.

In the following, we discuss how our profiling architecture works under the Linux OS. Under Linux, retrieving counter values in the monitor module of our profiling architecture can be implemented in two ways: system call or device driver. Since the monitor module maps to the I/O memory space, accessing counters of the monitor module requires the system be under kernel mode. In this case, system call is one common solution for user programs to access privileged I/O memory space. Using system calls to access counter values can have several disadvantages. For example, we must implement a lot of system calls to get the various counter values, which results in poor scalability and large performance measurement overhead. Moreover, executing system call can incur more hardware events (i.e. more than 20000 cache miss), and therefore the profiling results is inaccurate. Besides, extending system calls requires kernel recompilation, and one system call may cause thousands of cache misses due to mode switches.

With the memory mapping to user space, the user program can get the pointer to the base address of our monitor hardware. Through the offset, we can read counter values or write to the control register to enable, disable, or clear the hardware block. We also define a set of PMU library for user programs to communicate with device drivers, hiding low-level details from upper application developers. The pmu_init function involves two initializations: opens device node and does the memory mapping that maps the hardware I/O space to user space for fast accessing. This is followed by the pmu_clear function, which zeros all the counter values and enables performance event counting of the monitor module. After program execution, the pmu_msg function is invoked

Another way of accessing counter values would be device driver, which is the most common solution in terms of

523

to stop monitoring and read event statistics back. The pmu_end function does the reverse action of pmu_init to end profiling.

Table 1. Xilinx Vertex5 FPGA synthesis result of target platform with MPPA architecture. Resource

Summary. LEON3 is an open source SoC design suitable for fast prototyping. Applying our MPPA to LEON3 requires only small modifications to the processors for embedding event detectors, connecting the output event signals to the monitor module and finally mapping the event counters to memory segment. Under Linux, we implement device drivers to manipulate the monitor module and access event counters. This not only eliminates performance measurement overheads but also allows more scalability and flexibility. With the integration of hardware architecture and software support, MPPA provides an efficient, compact and less intrusive design for performance measurement without using dedicated bus or co-processor interfaces.

5.

Used

Max available

%Used

Slice Logic Utilization Number of Slice as Flip Flops

14,313 (+929)

28,800

49% (+3%)

Number of Slice LUTs

27,963 (+2659)

28,800

97% (+10%)

Total equivalent gate count for design

5.1

5,718,724 (+37,540 )

Case study

In this section we demonstrate the use of MPPA with a case study. The example multi-threaded program processes a large integer array with two threads. We will develop a rudimentary version, case1, and use our MPPA to check the performance bottlenecks. A better version, case2, is then developed to correct the problems. In the first version, thread 0 is responsible for even number elements and thread 1 for odd number elements. Figure 7 shows the workload distribution of the first version.

EVALUATION AND APPLICATION

In this section, we will introduce the prototyping platform for MPPA and the synthesis result. Then, we will demonstrate the use of the platform by a case study. We set up the prototyping system with dual LEON3 processors as Figure 6 shows. The LEON3 cores are realized on the Xilinx ML501 FPGA emulation board [20]. FPGA synthesis result and device utilization summary are shown in Table 1. The execution frequency of the target platform is about 80MHz. The result shows our architecture with twenty-three 32-bit counters only causes 0.66% increase of total gate count.

Through the performance statistics by MPPA, we can observe system behavior, e.g. data cache miss rate and cache coherency miss counts, as Table 2 shows. From the table, we can find that most cache misses come from coherency invalidation, i.e. coherency misses. Since the element size is smaller than the cache line size, each cache line update in core 1 will cause the snoop control unit to invalidate the same cache line in core 2. Thus, using this

Figure 6. Dual LEON3 processor system.

MPPA has the lowest complexity for an open source platform to apply. Without extra special bus or instruction design, it provides compatibility and scalability while preserving the ability to support run-time feedback. Besides MPPA also has the capability to monitor both the processor and bus events for common shared bus architecture design.

Figure 7. Interlaced workload distribution for two processors

kind of workload partition will result in poor cache performance in MPSoC. Figure 8 shows an improved

524

version. Core 1 is response for the first part of the data array and core 2 for the second half.

We have also verified this result by counting the ideal performance manually. First, we convert the C program source code of case2 to assembly code and extract the most time consuming part. From the specification of LEON3 processor, we get the ideal CPI (Cycle per Instruction) of the SPARC instruction set. Except STORE instruction (ST) costing 4 cycles, other instructions used in our program will be counted as one cycle assuming no cache misses. The ideal cycle count is calculated by instruction count times corresponding CPIs. Comparing the manual calculation (340,000,000 cycles) and the experimental result (362,412,833 cycles), we can find that our experimental result is very close to the ideal performance. Through this case study, we can see that our proposed MPPA is a useful tool which helps programmers to identifying and reasoning performance problems. Programmers can learn and analyze software and hardware behaviors more closely and may identify the bottlenecks of the application eaiser.

Table 2. Performance statistics from the execution of case1.

With this kind of partition, the cache coherency misses are greatly reduced as Table 3 shows. After the modification of memory access pattern, we get an improvement of 1.228 times faster than the first version.

6.

CONCLUSIONS

In this paper, we propose a general multiprocessor profiling architecture (MPPA) for prototyping MPSoC-based embedded systems during the design phase. MPPA embeds event detectors inside the processors and system components for event-sensing and uses an external monitor module for controlling and collecting event information. In order to control and access the monitor module, we use Linux device drivers to map profiling registers to user space for fast access. We implement MPPA using the LEON3 cores in synthesizable VHDL and emulate the whole system on Xilinx ML501 FPGA platform. FPGA synthesis result shows our architecture with twenty-three 32-bit counters only causes 0.66% increase of total gate count. In this paper, we treat all event counters equally. Probably, it leads that to find system bottlenecks in the more complex architecture becomes more difficult. Therefore, it is worth to address how to use these event values to do performance analysis in the future. Additionally, we plan to enhance MPPA with interrupt capability to support time- or eventbased sampling. Besides, we will also investigate automatic integration of MPPA to the target systems. We also need to address the issue of counter overflow and handle it in a proper way.

Figure 8. Better workload distribution for two processors Table 3. Performance statistics from the execution of case2.

7.

Acknowledgements

This work is funded by the Industrial Technology

Research Institute and National Science Council grant NSC 97-2220-E-007-038.

525

8.

REFERENCES

[10] ARM limited, http://www.arm.com

[1] Luca Benini, David Bertozzi, Alessandro Bogliolo, Francesco Menichelli, Mauro Olivieri, “MPARM: Exploring the Multi-processor SoC Design Space with SystemC”, Journal of VLSI Signal processing, Volume: 41, Page(s): 169-182, 2005.

AMBA

[11] Jiri Gaisler, GRLIB http://www.gaisler.com

Specification IP

cores

2.0, Manual,

[12] Jiri Gaisler, GRMON User's Manual, http://www.gaisler.com [13] Jiri Gaisler, The LEON3 Processor User's Manual, http://www.gaisler.com

[2] Wander O. Cesario, Amer Baghdadi, Lovic Gauthier, Damien Lyonnard, Gabriela Nicolescu, Yanick Paviot, Sungjoo Yoo, Ahmed A.Jerraya, Mario Diaz-Nova, “Multiprocessor SoC Platforms: A Component-Based Design Approach”, IEEE Trans. on Design and Test, Nov-Dec 2002.

[14] Daniel Hellström, Snapgear http://www.gaisler.com

for

LEON

manual,

[15] IBM, Device Control Register Bus 3.5 Architecture Specifications, http://www01.ibm.com/chips/techlib/techlib.nsf/techdocs/ 2F9323ECBC8CFEE0872570F4005C5739

[3] Jonathan Corbet, Greg Kroah-Hartman, and Alessandro Rubini, Linux Device Drivers, 3rd ed. Sebastopol: O’Reilly, 2005. [4] Multiprocessor Systems-on-Chip, Ed. Ahmed Jerraya, Wayne Wolf, San Francisco California: Elsevier Morgan Kaufmann, 2005.

[16] IBM, PLB Performance Monitor User’s Manual, http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/ 904F6514C02DC4D487256B9E0054315D

[5] Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg, Johan Högberg, Fredrik Larsson, Andreas Moestedt, Bengt Werner, “Simics: A Full System Simulation Platform”, IEEE Trans. on Computer, Volume: 35 Issue: 2, Feb 2002.

[17] Inc. SPARC International, SPARC Architecture Manual Version 8, http://www.sparc.org/standards/v8.pdf [18] Intel, 3rd Generation Intel Xscale Microarchitecture Developer's Manual, http://developer.intel.com/design/intelxscale/316283.htm

[6] Erno Salminen, Ari Kulmala, and Timo D. Hämäläinen, “HIBI-based Multiprocessor SoC on FPGA”, Proc. IEEE International Symposium on Circuits and Systems (ISCA2005), Volume: 4, Page(s): 3351-3354, May 2005.

[19] Hassan Shojania, Hardware-based performance monitoring with VTune Performance Analyzer under Linux, http://hassan.shojania.com

[7] Mohammed El Shobaki and Lennart Lindh, “A Hardware and Software Monitor for High-Level System-on-Chip Verification”, Proc. IEEE International Symposium on Quality Electronic Design. San Jose, USA, March 2001.

[20] Xilinx, ML501 Evaluation platform User Guide, http://www.xilinx.com/products/boards/ml501/docs.htm [21] Abhishek Bhattacharjee, Gilberto Contreras and Margaret Martonosi, “Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation”, International Symposium on Low Power Electronics and Design, Page(s):335-340, August 2008.

[8] Brinkley Sprunt, “Pentium 4 Performance-Monitoring Features”, IEEE Micro, Volume: 22 Issue: 4 Page(s): 72-82, July-Aug 2002. [9] Brinkley Sprunt, “The Basics of Performance-Monitoring Hardware”, IEEE Micro, Volume: 22 Issue: 4 Page(s): 64-71, July-Aug 2002.

526