Embedded Multiprocessor Systems-on-Chip Programming embedded

1 downloads 0 Views 334KB Size Report
May 2, 2009 - code starting from the original application sequen- tial C code. ..... this research are part of the Mosart (Mapping Opti- mization for Scalable ...
focus 1

embedded software

Embedded Multiprocessor Systems-on-Chip Programming Jean-Yves Mignolet and Roel Wuyts, IMEC Belgium

A programming approach supported by a tool chain offers solutions for the mapping of demanding applications (such as multimedia and wireless) onto multiprocessor systems-on-chip platforms.

34

IEEE Soft ware

T

he first industrial multiprocessor system-on-chip (MPSoC) platforms have found their way into the embedded systems we use in our everyday life, the IBM Cell processor1 has been deployed in game consoles, and ARM (with the ARM11 MPcore2) and Texas Instruments (TI) (with the TIC64743) are delivering multicore products. For cost reasons, the use of MPSoC platforms will likely extend to other embedded devices such as cell phones or portable media players. These programmable architectures provide the flexibility and computing power

to tackle present and future applications. But an important question remains: how can embedded designers efficiently and productively exploit the computing power? We can reformulate this question as, what programming model should the programmer use to efficiently and productively program these MPSoC architectures? In a broad sense, a programming model is a set of software technologies and abstractions that lets the designer express the program or algorithm in a way that matches with the target architecture. These software technologies exist at different abstraction levels and encompass programming languages, libraries, compilers, and runtime mapping components. However, a designer can’t exploit the full potential and scalability of the MPSoC platform by using a single sequential task to describe a program. To exploit this potential, the designer must use a parallel-programming model. This means that to map the required behavior on the available resources the designer must solve four issues:

Published by the IEEE Computer Society

■■ Dividing a program consisting of a single task into one with multiple balanced and communicating tasks. This process must take the target platform’s properties into account. This action is error-prone and time consuming. ■■ Managing intertask communication and synchronization. The designer must carefully consider this to avoid serialization of the parallel program and the occurrence of race conditions—that is, problems with data coherency in different threads. Again, this action is error-prone and time consuming. Furthermore, it introduces the risk of deadlocks. ■■ Handling contention of on-chip resources (for example, shared memories and shared interconnect fabric) to minimize performance penalties and mitigate predictability and composability issues. ■■ Debugging a program on a multiprocessor platform, which is much more challenging than debugging on a single-processor platform. 0 74 0 -74 5 9 / 0 9 / $ 2 5 . 0 0 © 2 0 0 9 I E E E

Embedded software developers can implement parallel-programming models in three ways: by creating a new parallel-programming language, by extending an existing language, or by invoking libraries from a traditional sequential language. In turn, compilers, hardware abstraction libraries, or runtime management components can match the program description with the underlying architecture. Dealing with all these issues dramatically increases the design time for even a single mapping. In practice, designers must make and compare multiple alternative mappings during the exploration phase. But owing to the cost and time for making even one iteration, designers are often limited to trying only a few possible mappings during exploration, which leads to suboptimal results. Over the course of many years, IMEC researchers have been working on making development for MPSoC systems easier. This effort has resulted in a toolflow to map applications on MPSoC platforms, generating correct-by-construction parallel code starting from the original application sequential C code.

Contemporary MPSoC Programming Solutions

There are four common ways of dealing with the complexity of MPSoC development. The first and simplest way uses the generalpurpose processor as the main processing element and the other (heterogeneous) processors as accelerators. This solution lets the designer create sequential code while still exploiting the MPSoC platform resources, which, in turn, results in low design effort. However, this approach isn’t scalable because it potentially relies on a centralized (runtime) dispatching resource. If accelerator functions are statically assigned, resource usage flexibility is lost. Furthermore, this approach can’t exploit the full potential of MPSoC platforms because it deals only with acceleration and not with parallel execution; that is, the main program stalls until the accelerated function returns. On the plus side, without actual parallel execution, there’s no data coherency issue. A second option is to design or buy a commercial multiprocessor real-time operating system (MP-RTOS)—for example, the Quadros RTXC/ MP-RTOS—and use the primitives it provides with respect to communication, threading, and synchronization. From an embedded-designer viewpoint, this involves dividing the sequential code into threads by partly developing code from scratch and partly reusing the available sequential code. In addition, the designer must discover and

manually insert the interthread synchronization. This is, of course, time consuming and error-prone. Furthermore, such an MP-RTOS represents only an abstraction level on top of the MPSoC platform hardware. An MP-RTOS typically can’t solve all the problems related to scalability, flexibility, and resource contention, so they remain the designer’s responsibility. The designer must also assume that the hardware solves any data coherency problems. A third option is to create parallel threaded code with assistance from tools such as OpenMP.4 Obviously this represents less manual work regarding code writing, but it remains error-prone. The designer still must handle critical issues such as intertask synchronization and intertask communication. Coherency problems still exist, and the designer must assume they’re solved in hardware, which is (again) a costly and nonscalable solution. The contention between shared resources remains a problem, as does the assignment and resource scheduling. The latter is typically assumed to be solved by RTOS, but there are no real guarantees with respect to mapping predictability. Finally, the designer can manually create parallel threaded code that uses explicit data communication with message passing (MPI5) or relies on the Portable Operating System Interface (Posix) library.6 This means that the parallelization is the designer’s responsibility. It remains time consuming and errorprone because the designer must add explicit data communication (while taking the platform properties into account). Data coherency no longer poses a problem, but there’s no longer a shared-memory concept. Furthermore, this approach typically uses more (expensive) memory resources. Regardless of the technique chosen to deal with MPSoC development complexities, the end result is a mapping of behavior on processing elements. To verify that the mapping is functionally correct and performs as desired, the designer must execute a wide range of simulation experiments. If the desired performance isn’t achieved, the designer must repeat the mapping process (different parallelization, different synchronization). This adds a significant penalty to the mapping design time. Furthermore, guaranteeing that a combination of application components will work represents an even harder task, because the designer must use the same wide range of experiments for many application component combinations. The bottom line is that all these contemporary MPSoC programming approaches provide relief with respect to the programming and application component mapping burden but fail to handle fundamental issues. Our MPSoC programming model

In practice, designers must make and compare multiple alternative mappings during the exploration phase.

May/June 2009 I E E E S o f t w a r e 

35

a given set of platform parameters; and ■■ can provide feedback to the designer on a sequential level.

Application Sequential ANSI C code

Cleaning tools

Profiling

Application Sequential clean C code

Application profile

MPSoC mapping tools

Platform description (memory hierarchy) Parallelization specification

Application Parallel threads C code

Compiler 1

Compiler 2



Compiler n

Executable 1

Executable 2



Executable n

RT-LIB API

MPSoC

Runtime library

Figure 1. IMEC’s multiprocessor systemon-chip (MPSoC) flow for mapping sequential ANSI C code onto an MPSoC platform. The different tools gradually transition the original sequential C code to a parallel code that executes on the runtime library API.

addresses these issues while minimizing the impact on the embedded designer.

The IMEC MPSoC Programming Approach

To tackle the MPSoC programming issues and relieve the embedded designer as much as possible, our team focused on providing tools to help map sequential C code efficiently and predictably onto a multiprocessor platform. Hence, these mapping tools ■■ generate parallelizations based on user specifications and add synchronization and intertask communication; ■■ explicitly control data communication to avoid the hardware cache coherency problem/ scaling bottleneck and avoid resource contention; this involves the active management of the memory hierarchy communication; ■■ make the code transition to the parallelprogramming model and ensure the transition is functionally correct and optimized for

36

IEEE Soft ware

w w w. c o m p u t e r. o rg /s o f t w a re

Our approach incurs some software and hardware costs. On the software side, C provides much expressiveness to specify what a program must do, but as a result some of its constructs are hard for tools to analyze (for example, moving pointers around arithmetically). Although our tools accept ANSI C code as input, their results will be less optimal (but still correct) with such hard-to-analyze constructs. This holds for not only our mapping tools but also all C analysis tools in general. So, our team developed CleanC (we discuss this in more detail later), a set of guidelines that describe how to write code so that C analysis tools can achieve maximum accuracy.7 The hardware cost is incurred because the MPSoC platform needs to provide services that let a mapping tool reason about parallelization performance, intertask communication, resource assignment, resource scheduling, and application component performance. These services include, for example, a bounded latency on a memory-tomemory data movement transaction—which the mapping tool requires to efficiently schedule the data transfer in the memory hierarchy. This bounded latency can be implemented on the platform through communication bandwidth reservation on the interconnect. Not having this service would provide a still correct but far less optimal code. Another example is a service that enables a software-controlled memory hierarchy. The runtime library requires this service to implement the APIs related to data block transfer between the different memories. It can be implemented with software-programmable direct memory access (DMA) block. In short, with more available services, the runtime library API implementation will run more efficiently. Figure 1 shows our MPSoC flow for mapping sequential ANSI C code onto a multiprocessor platform. The MPSoC interactive cleaning tool derives CleanC code by removing as many undesired language constructs as possible. Then the designer uses the MPSoC mapping tools to map the code onto the multiprocessor platform. These mapping tools require CleanC, high-level platform properties, and designer parallelization hints. This sequential C code can contain calls to accelerator functionality. The MPSoC mapping tools parallelize the sequential application component. This includes automatically adding intertask communication and synchronization. Furthermore, these tools explic-

itly manage and exploit the scratch pad memory hierarchy in a component-specific way. The generated parallel code is correct-by-construction. The tools map the application component onto a parallel-programming API provided by the runtime library (RTLib). This library is implemented on top of the platform hardware services (for example, DMA and communication channels with guaranteed bandwidth or latency) and is therefore provided by the platform designer together with the MPSoC platform itself. The embedded software designer can, if desired, program the platform directly by using the RTLib API but shouldn’t be concerned with the underlying implementation. Finally, the back-end compiler (or compilers if the MPSoC is heterogeneous) transforms the threads’ source code into platform executables.

MPSoC Cleaning Tools

From the viewpoint of mapping-tool capabilities, we can divide the rules for CleanC into two categories: code restrictions and code guidelines. Code restrictions are constructs that aren’t allowed in the input C code or the tools will refuse to work. There are three restrictions: source files should be distinguished from header files, varargs are forbidden, and the results of calls to malloc and its brethren should be cast to the correct type. Code guidelines describe how to write code to give the mapping tools maximum accuracy and transformation freedom. There are 24 guidelines, such as protect header files against recursive inclusion, use preprocessor macros only for constants and conditional exclusion, and use multidimensional indexing for arrays. These guidelines are described with detailed motivations, examples, and solutions. The IMEC team developed a set of interactive tools to analyze and help rewriting existing C code for adherence to CleanC. Detailed analysis will point to violations against guidelines and restrictions and give detailed error messages to remedy the problem. Search functionality is also built-in—for example, to find whether each particular pointer variable is assigned more than once. The tools are fully integrated in the Eclipse development environment and exploit all the features this environment offers. The tools can help both develop new pieces of code and analyze existing bodies of C code. We stress that code needn’t entirely adhere to CleanC before the actual mapping tools can be used. But the mapping tools will produce better results according to how closely the code follows the CleanC guidelines. We believe these guidelines have a larger scope than our own tools and are beneficial for embed

ded system programming in general, and for multi­ processor systems specifically. We’ve therefore made the analysis tools for CleanC available (www. imec.be/CleanC). IMEC also joined the Multicore Association (www.multicore-association.org), where we contribute our experience on multicore programming practices.

MPSoC Mapping Tools

The MPSoC mapping tools help with two common tasks when developing software for MPSoC systems. The memory hierarchy (MH) tool deals with exploiting the memory hierarchy of embedded systems by transforming code and mapping data on particular memory regions to minimize the application’s overall power consumption. The second tool, MPSoC Parallelizing Assistant (MPA), helps with parallelizing applications.

The CleanC guidelines are beneficial for embedded system programming in general, and for multiprocessor systems specifically.

Memory Hierarchy Management To avoid a power and performance bottleneck in main memory accesses, it is often useful to give threads local copies of data that they can access in local memories. Obviously, the energy and bandwidth consumed in making copies and transferring blocks of data between different memories should be balanced with the gain of avoiding repeated main memory accesses. To enable the developer to find a good balance, we’ve developed the MH tool to automate the code rewriting involved in inserting data copies and data block transfers. This tool lets the developer explore different options without having to rewrite C code manually. In other words, this tool enables our flow to scale to larger platforms and applications. The tool uses two inputs: a description of the platform’s high-level properties (which memory partitions exist, their properties, and how they’re connected) and a profile of the application (used to optimally schedule—through prefetching— the data block transfers). The tool performs this profiling on the sequential code using, for example, an instruction set simulator. It concerns only the application’s kernels. We can distinguish two phases in the MH tool. The first is data reuse analysis. In this phase, the tool analyzes the code for potential reuse opportunities; that is, it searches for places where it’s possible to introduce temporary copies of parts of the original data structures. Because there are usually one or more copy candidates for each access at every level in the syntax tree, the total number of these copy candidates can become large. In the second phase, the MH tool takes into account two cost factors for optimization: the cycle May/June 2009 I E E E S o f t w a r e 

37

Execution length (millions of cycles)

Execution length (millions of cycles)

70 60 50 40 30 20 10 0 SPM 16k D$ 16k Active Stall

(a)

250 200 150 100

D$ 64k Halt

50 0 SPM 16k D$ 16k Active Stall

D$ 64k Halt

70

25

60

20

Power (mW)

Power (mW)

50 15 10

40 30 20

5

(b)

10

0 SPM 16k D$ 16k ADRES L1 L2

SPM Scratch-pad memory D$ 16K cache of 16k D$ 64k cache of 64k

Figure 2. (a) Execution length breakdown for 30 frames (1 second) of the MPEG-4 SP encoder. (b) Overall power consumption. The left graphs are for Quarter Common Interface Format (QCIF) resolution, and the right graphs are for Common Interface Format (CIF) resolution.

38

IEEE Soft ware

0 SPM 16k D$ 16k ADRES L1 L2

D$ 64k CA

ADRES L1 L2 CA

D$ 64k CA

Architecture for Dynamically Reconfigurable Embedded System Level 1 memory Level 2 memory Communications Assist

cost and the energy due to memory accesses. Because this is a multiple-objective optimization problem, several “optimal” solutions might exist; that is, there might be solutions that are optimal in cycles but that require more energy, or vice versa. We can connect these solutions with a Pareto curve. For each point on the curve, no solution exists that’s better both in terms of energy cost and cycle cost. The MH tool uses a search heuristic to perform this optimization. Prefetching is also introduced—that is, allowing the copies to be started as soon as possible so that they proceed in parallel with the computation. More information on the MH tool and related research (for instance, that of Sumesh Udayakumaran and his colleagues8) is reported by Rogier Baert and his colleagues.9 To prove the MH tool’s usability and benefits, the IMEC team compared an application mapping on a scratch-pad-based platform using the MH tool to a mapping on a cache-based platform.9 The application is an MPEG-4 SP encoder,

w w w. c o m p u t e r. o rg /s o f t w a re

an industry-standard hybrid video encoder. The platform consists of a processor (a multimedia instance of ADRES—Architecture for Dynamically Reconfigurable Embedded System10) with its local memory (cache or scratch pad), a bus, and main memory. Figure 2a shows the number of cycles required to perform real-time (30 frames per second) MPEG-4 encoding for the Quarter Common Interface Format (QCIF) and Common Interface Format (CIF). We report the low resolutions because the team performed the experiments in real time on a cycle-accurate simulator that’s too slow to simulate a significant number of frames (300 in our case) at larger resolutions. The figure shows that the scratch-pad-based platform is significantly faster than the cache-based one. The scratch-pad memory size is 16 Kbytes. Even when incurring the cost due to a cache memory larger than 16 Kbytes, the CPU clock frequency requirement for a cachebased system to meet the real-time constraint is larger. It’s 68 percent higher for a 16-Kbyte cache and still 27 percent higher for a 64-Kbyte cache. Power consumption rates also favor the scratchpad-based system (see Figure 2b). The gain for overall system power is 30 percent and 19 percent when compared to a 16-Kbyte and 64-Kbyte cache platform, respectively.

Exploration of Parallel Mapping of an Application The general idea of parallelization with the MPA tool developed at IMEC is that the designer identifies heavily executed parts of the sequential code that should be executed by multiple threads in parallel to improve the application component’s performance. We denote pieces of code selected for parallelization as parsections. For each parsection, the designer specifies the number of threads that execute it. The designer can divide the work over threads in terms of functionality (that is, functional parallelism) or in terms of loop iterations (that is, data parallelism), or a flexible combination of both depending on what’s most appropriate for a given parsection. These parallelization directives are written in a file provided to the mapping tool (for an example of such a file, see “Exploring Parallelizations of Application for MPSoC Platforms Using MPA”11). Using directives in a separate file instead of inserting pragmas in the input code simplifies exploration (and retainment) of multiple parallelization specifications for the same sequential code. Given the input code and the parallelization directives, the tool generates a parallel version of the

code and inserts first-in, first-out queues (FIFOs) and synchronization mechanisms where needed. So, the designer doesn’t need to worry about dependencies between threads and so on. The tool takes care of this automatically. This way, the tool will always generate correct-by-construction parallel code. However, because designers like to be in control, MPA provides the optional mechanism of shared variables. By specifying a variable as shared, the designer explicitly tells the tool that it doesn’t have to add synchronization and communication mechanisms for that variable because he or she will take care of that. The designer can do this in several ways. One way is to add dummy dependencies (that is, an extra variable that isn’t required for computation but that will cross the thread’s boundary) that will be converted into scalar FIFOs (the extra dummy variable will be chosen as a scalar, not an array). Another way is to specify a loop sync—that is, a structure that synchronizes certain iterations of a loop with other iterations of loops in other threads. Sometimes, there’s no need even to add additional synchronization mechanisms, if the dependencies are already guaranteed by existing ones. The tool will check as much as possible for possible inconsistency in the synchronization for the shared variables; however, it can’t guarantee correct-byconstruction parallel code when the designer uses these shared variables. Baert’s most recent work gives more details on the different steps of the tool (analysis, transformation, and communication/synchronization).11 MPA isn’t a parallelizing compiler: it won’t generate the best parallelization based on an arbitrary sequential code. But it does generate parallel code based on user directives. To demonstrate the MPA tool’s capabilities, the IMEC team explored different parallelization options for an MPEG-4 SP encoder for an embedded platform based on an ARM core.11 To traverse the design space, we first made an initial ParSpec, based on the required performance and the top-level profile report. Considering the execution profile (see Figure 3a on page 40), motion estimation (ME) clearly dominates the overall performance; it consumes almost twice the number of cycles of all the other kernels combined. So, if ME is split off and executed in parallel with the rest of the application, we can expect 1.4 speedup (see Figure 3b, column B). The next step is to apply a data split. The designer can easily apply this to ME, but he or she must consider that the efficiency of a data split is often not 100 percent: preamble and postamble

for pipelined execution, load imbalance, and synchronization overhead decrease the efficiency (see Figure 3b, column C). To start with, the team explored a moderate data split into two, which already results in an extra speedup (see the three-thread point in Figure 3c). Further refinement shows that combining the ME and MC (motion compensation) functionalities often leads to better results because these functionalities share much data, such as the video frames, that otherwise must be communicated (see Figure 3b, column D). Figure 3c shows all Pareto-optimal parallelizations the exploration found. It shows that the designer can generate a wide range of solutions, in terms of both performance (for example, execution time) and cost (for example, the number of processors used).

A well-designed runtime library interface carefully balances abstraction and performance.

Runtime Library Functionality and Specification

To ensure that developers can efficiently program MPSoC platforms using a design flow such as ours, we need a runtime library that abstracts the hardware. By means of different platformdependent implementations, the runtime library offers a platform-independent interface to the hardware resources. A well-designed runtime library interface carefully balances abstraction and performance. A more-abstract interface hides more of the platform, which improves application portability. On the other hand, for performance, the interface should expose platform features the developer can use to optimize the application for the platform. In our context, where we start from sequential code and make all decisions at design time, the following runtime functionality is required: an applicationstarting API (an API to start and stop the threads), an application configuration API (an API that defines the number of threads, FIFOs, and how they relate to each other), and a communication API (an API used to transfer data between the threads). We have performed several implementations of the RTLib. Starting from a generic reference implementation, porting to the Pthread library6 took less than one person-month. The team also ported the RTLib on a proprietary platform based on ADRES processors and on the ARM11 MPCore platform.

W

e have demonstrated the toolflow developed at IMEC for the MPEG-4 encoder on different platforms. In the future, we want to integrate the different tools in May/June 2009 I E E E S o f t w a r e 

39

Gcycles sum in functions

4

3 ME MC TC TU EC VLC

2

1

0

(a)

ME MC TC

TU

Motion estimation Motion compensation Texture coding Texture update Entropy coding Variable-length coding

EC VLC REST ME ME ME

ME ME ME ME

MC

MC MC MC

MC MC MC MC

MC

MC

TC

TC

TC

TC

TC

TC

TC

TC

TU

TU

TU

TU

TU

TU

TU

TU

TU

EC

EC

EC

EC

EC

EC

EC

VLC

VLC

VLC

VLC

VLC

VLC

VLC

A

B

C

D

E

F

G

2

3 4 No. of threads

ME

ME

MC

MC

TC

(b)

ME

ME

ME ME ME

ME ME ME ME MC MC MC MC

1.0

Speedup–1

0.8 0.6 0.4 0.2 0

(c)

0

1

5

6

7

Figure 3. (a) The function cycle distributions for ARM11. The horizontal axis shows the top level functionalities. The vertical axis shows a functionality’s total contribution to the overall cycle count. (b) A schematic overview of the ARM exploration, displaying seven different parallelizations (A to G). Dashed lines indicate functionalities; solid lines indicate threads. So, on the left side (A), the sequential implementation is shown. (c) The Pareto curve of the parallelizations corresponding to (b). The y-axis is the inverse of the speedup—that is, the parallel version’s execution time divided by the sequential version’s execution time.

the flow even more, because they are mainly used separately today. Furthermore, we are in the process of validating the flow on more applications, particularly in the wireless-communication domain.

Acknowledgments

The list of people who have contributed to this work 40

IEEE Soft ware

w w w. c o m p u t e r. o rg /s o f t w a re

is too long to be cited here. However we want to acknowledge the MPSoC team at IMEC. The research described in this article was performed in the context of IMEC’s nomadic-embedded-systems program, which is partly sponsored by Samsung. Elements of this research are part of the Mosart (Mapping Optimization for Scalable Multicore Architecture) project, funded in the frame of European ICT FP7 Computing Systems, and the MNEMEE (Memory Management Technology for Adaptive and Efficient Design

About the Authors

of Embedded Systems) project, funded in the frame of European ICT FP7 Embedded Systems Design.

Jean-Yves Mignolet is an activity leader at IMEC Belgium. His work concentrates on design time tools for multiprocessor systems-on-chip. His research interests include digital design, reconfigurable hardware, and processor and multiprocessor architecture and modeling. Mignolet has an MS in electrical engineering from Université Catholique de Louvain. Contact him at [email protected].

References 1. J.A. Kahle et al., “Introduction to the Cell Multiprocessor,” IBM J. Research and Development, vol. 49, July/ Sept. 2005, pp. 589–604. 2. K. Hirata and J. Goodacre, “ARM MPCore: The Streamlined and Scalable ARM11 Processor Core,” Proc. 2007 Asia and South Pacific Design Automation Conf. (ASP-DAC 07), IEEE Press, 2007, pp. 747–748. 3. S. Agarwala et al., “A 65nm C64x+ Multi-Core DSP Platform for Communications Infrastructure,” Proc. 2007 IEEE Int’l Solid-State Circuits Conf. (ISSCC 07), IEEE Press, 2007, pp. 262–601. 4. B. Chapman et al., Using OpenMP: Portable Shared Memory Parallel Programming, MIT Press, 2007. 5. “MPI-2: Extensions to the Message-Passing Interface,” MPI Forum, Nov. 2003; www.mpi-forum.org/docs/ mpi-20-html/mpi2-report.html. 6. ISO/IEC 9945-1:1996, Portable Operating System Interface (POSIX)—Part 1: System Application Program Interface, Int’l Standards Org., Nov. 1996; www.iso. org/iso/iso_catalogue/catalogue_tc/catalogue_detail. htm?csnumber=24426. 7. M. Van Bavel and M. Tilman, “Interactive C-Code Cleaning Tools Support Multiprocessor SoC Design,” Embedded Systems Design Europe, Aug./Sept. 2008, pp. 12–19. 8. S. Udayakumaran et al., “Dynamic Allocation for Scratch-Pad Memory Using Compile-Time Decisions,” Trans. Embedded Computing Systems, vol. 5, no. 2, 2006, pp. 472–511.

Roel Wuyts is a senior research engineer at IMEC Belgium and a part-time professor

in the Katholieke Universiteit Leuven Computer Science Department. His research interests include embedded software engineering, specifically componentized runtime resource managers and distributed resource managers for networked embedded devices. Wuyts has a PhD in computer science from Vrije Universiteit Brussel. Contact him at [email protected].

9. R. Baert et al., “An Automatic Scratch Pad Memory Management Tool and MPEG-4 Encoder Case Study,” Proc. 45th ACM/IEEE Design Automation Conf. (DAC 08), ACM Press, 2008, pp. 201–204. 10. B. Mei et al., “Architecture Exploration for a Reconfigurable Architecture Template,” IEEE Design & Test of Computers, vol. 22, no. 2, 2005, pp. 90–101. 11. R. Baert et al., “Exploring Parallelizations of Application for MPSoC Platforms Using MPA,” to be published in Proc. Design, Automation, and Test in Europe (DATE 09), ACM Press, 2009.

AdvertiSER Information May/June 2009 • IEEE Software

Advertiser Better Software 2009 JavaOne 2009 Seapine Software

Page Cover 3 65 Cover 4

Recruitment:

Midwest/Southwest Darcy Giovingo Phone: +1 847 498 4520 Fax: +1 847 498 5911 Email: dg.ieeemedia@ ieee.org

Advertising Personnel Marion Delaney IEEE Media, Advertising Dir. Phone: +1 415 863 4717 Email: [email protected]

Mid Atlantic Lisa Rinaldo Phone: +1 732 772 0160 Fax: +1 732 772 0164 Email: lr.ieeemedia@ ieee.org

Marian Anderson Sr. Advertising Coordinator Phone: +1 714 821 8380 Fax: +1 714 821 4010 Email: [email protected]

New England John Restchack Phone: +1 212 419 7578 Fax: +1 212 419 7589 Email: j.restchack@ ieee.org

Japan Tim Matteson Phone: +1 310 836 4064 Fax: +1 310 836 4067 Email: tm.ieeemedia@ ieee.org

Southeast Thomas M. Flynn Phone: +1 770 645 2944 Fax: +1 770 993 4423 Email: flynntom@ mindspring.com

Europe Hilary Turnbull Phone: +44 1875 825700 Fax: +44 1875 825701 Email: impress@ impressmedia.com

Sandy Brown Sr. Business Development Mgr. Phone: +1 714 821 8380 Fax: +1 714 821 4010 Email: [email protected]



Advertising Sales Representatives

Northwest/Southern CA Tim Matteson Phone: +1 310 836 4064 Fax: +1 310 836 4067 Email: [email protected]

Product: US East Joseph M. Donnelly Phone: +1 732 526 7119 Email: jmd.ieeemedia@ ieee.org US Central Darcy Giovingo Phone: +1 847 498 4520 Fax: +1 847 498 5911 Email: [email protected] US West Lynne Stickrod Phone: +1 415 931 9782 Fax: +1 415 931 9782 Email: [email protected] Europe Sven Anacker Phone: +49 202 27169 11 Fax: +49 202 27169 20 Email: sanacker@ intermediapartners.de

May/June 2009 I E E E S o f t w a r e 

41