Programming the Memory Hierarchy - Parallel Programming Laboratory

Programming the Memory Hierarchy Kayvon Fatahalian

Timothy J. Knight Mike Houston Ji Young Park Alex Aiken Pat Hanrahan William J. Dally

Mattan Erez

Stanford University

Abstract We present Sequoia, a programming language designed to enable programmers to develop portable, high-performance applications. Sequoia’s design is centered around the observation that highperformance programs are bandwidth efficient, and careful choreographing of data movement throughout the machine is key to application performance. By abstractly exposing hierarchical memory in the programming model, and by providing mechanisms to localize computation to particular memory locations in the machine, Sequoia encourages the development of programs that are bandwidth efficient yet remain portable across machines with different memory hierarchy designs.

1. Introduction Writing a high-performance application, whether for a uniprocessor or a large scale parallel machine, requires the programmer to have non-trivial knowledge of details about the underlying system. On modern systems of all scales, a major aspect of performance optimization involves ensuring that processors do not idle frequently while waiting on memory. This involves structuring algorithms and placing data to improve program locality, so that data references are serviced by levels of the memory hierarchy that are as close to the processors as possible, whether it be on-die memory, local DRAM, or remote memory accessed over high-speed interconnect. Writing programs that make efficient use of a machine’s memory system is further complicated by desires for program portability. A series of optimizations targeted at the hierarchy of one machine is likely not the correct solution for particular sizes and physical configurations of other machines. The need for new programming abstractions for managing memory will soon become acute. Multicore microprocessors will be commonplace in the near future, and as more processing units are made to execute simultaneously on a chip, the importance of efficiently managing available memory bandwidth grows. Stream architectures, such as the CELL Broadband Engine [28] or Stanford’s Imagine [23] and Merrimac [14] processors are emerging as efficient platforms for executing computationally intensive workloads such as media processing and scientific computing. These designs feature many parallel ALUs, allowing them to provide an order of magnitude greater peak performance than current CPUs. The processing capability on such chips is so great that the ALUs

cannot be adequately sustained using typical cache designs and thus explicit management of on-chip levels of the memory hierarchy is essential. Unlike traditional processors, which manage the transfer of data between memory and levels of on-chip storage transparently in hardware, these new exposed communication architectures require software to move data in between distinct onand off-chip address spaces. On such platforms, the cost of increased arithmetic performance is software complexity. Explicit memory management is necessary for program correctness, not just performance. The challenges of managing data movement, formerly only a concern when programming large parallel machines, now exists at the workstation level. These difficulties compound as larger scale systems are considered. The principal idea of our work is that the movement and placement of data at all levels of the machine memory hierarchy should be under explicit programmer control. The primary focus of a programming model targeted at performance-conscious programmers should be to provide abstractions that assist in structuring programs to make efficient use of a target machine’s memory hierarchy. This paper presents Sequoia, a programming model that attempts to reflect the realities of performance-centric programming on modern hardware and assist the programmer in structuring programs that are efficient in their use of system bandwidth, yet are easily ported to new machines. The design of Sequoia centers around the following key ideas: • We introduce the notion of hierarchical memory directly into

our programming model. Sequoia programs run on machines that are abstracted as trees of distinct memory modules and describe how data is moved and where it resides in a machine’s memory hierarchy. • We use tasks as an abstraction of self-contained computation

units that include descriptions of key information such as communication and working sets. Tasks serve to isolate each computation in its own local address space and can also express parallelism. • To enable portability, we maintain a strict separation between

generic algorithmic expression and machine-specific optimization. To minimize the impact of this separation on performance, the machine-specific mapping of an algorithm remains under programmer control. We do not need to rely on compiler intelligence to generate a good mapping of high-level code to a target machine. Details of the Sequoia programming model are discussed in Sections 2 through 4. Section 5 describes our implementation of Sequoia language runtimes developed for a single PC and a cluster of workstations. Section 6 discusses results from initial experiments of running simple numerical algorithms using our system.

Created November 12, 2006.

1

2005/11/12

Main memory

Main memory

L2 cache

L2 cache

L1 cache

L1 cache

ALUs

LS LS LS LS LS LS LS LS ALUs ALUs ALUs ALUs ALUs ALUs ALUs ALUs

ALUs

Dual-CPU workstation

CELL processor

Figure 1. Example machine trees: A dual-CPU workstation (at left) is modeled as a tree containing nodes corresponding to main system memory and the L1 and L2 caches of the individual processors. The tree representation of a CELL processor-based workstation (at right) contains nodes for each of the microprocessor’s on-chip software-managed Local Stores (LS).

CPU

CPU

L1

L1

L2

L2

Node memory

Node memory

Virtual level: Aggregate cluster memory

Node memory

...

Node memory

L2 cache

...

L2 cache

L1 cache

...

L1 cache

ALUs

...

ALUs

interconnect Node memory

Node memory

L2

L2

L1 CPU

L1 CPU

Cluster of uniprocessor PCs

Tree representation of cluster of PCs

movement of data through the machine, but not the specific mechanism with which to perform the transfer, is essential to ensuring the portability of Sequoia programs while retaining the performance benefits of explicit communication. Sequoia allows freedom in how a target machine is represented as a tree of memories. The chosen representation may include modules corresponding to all physical levels of the machine memory hierarchy, or it may be a simplification that omits levels of the physical hierarchy that need not be considered for software correctness or performance optimization (For example, the workstation’s L1 cache may be omitted from the model of the cluster machine shown on the right hand side of Figure 1). For program correctness, all software managed levels of the memory hierarchy must exist in the machine representation. The decision to represent machines as trees was motivated by the desire to retain portability while minimizing programming complexity. A program that performs direct communication between sibling memories cannot run without modification on a platform where such channels do not exist. To accommodate non-tree topologies, the abstract representation of a machine hierarchy may include virtual levels that do not correspond to any single physical memory module. For example, it is not practical to expect the nodes in a cluster of workstations to communicate only via global storage provided by networked disk. As shown in Figure 2, we represent a cluster as a tree rooted by a module corresponding to the aggregation of all workstation memories. Transferring data from this root module into child modules associated with each of the individual cluster workstations results in communication over the cluster interconnect. The inclusion of virtual levels in machine representations enables a broad class of architectures to be represented abstractly as trees and allows non-tree communication links to be modeled as additional tree nodes. This permits applications to benefit from direct communication links provided by the hardware, even when programs are not written to explicitly manage these links. The Sequoia programming model is designed to encourage the development of programs that remain efficient despite these simplifying machine abstractions.

3. Sequoia Design Figure 2. The point-to-point links connecting PCs in a cluster are modeled as a virtual node in the tree representation of the machine.

2. Modeling Hierarchical Memory

The principal construct of the Sequoia programming model is a task: a side-effect free function with call-by-value-result parameter passing semantics. Tasks provide for the expression of: • Explicit Communication and Locality. Communication of

Sequoia requires the programmer to reason about a parallel machine as a tree of distinct memory modules; a representation similar to the Parallel Memory Hierarchy (PMH) model of Alpern et al. [4]. Data transfer between memory modules is conducted via (potentially asynchronous) block transfers. Program logic manages the transfers of data at all levels, but arithmetic processing is limited to data located within leaf nodes of the machine tree. A machine tree modeling a dual-CPU workstation is shown in the left half of Figure 1. The model contains nodes corresponding to system memory shared between the two CPUs as well as the L1 and L2 caches of each processor. The right half of Figure 1 shows a tree model of a system containing a CELL processor. Establishing an abstract notion of hierarchical memory is critical to the Sequoia programming model. Sequoia programs never make references to a particular level of a machine’s memory hierarchy and remain oblivious to the mechanisms used to move data between memory modules. For example, communication may be implemented using a cache prefetch instruction, a DMA transfer, or an MPI message depending on the requirements of the target architecture. Supplying the programmer constructs to describe the

2

data through the memory hierarchy is expressed by passing arguments to tasks. Calling tasks is the only means of describing data movement in Sequoia. • Isolation. Tasks operate entirely within their own private ad-

dress space and are not permitted to communicate with any other tasks except by calling subtasks. This enables tasks to be run concurrently when no dependencies exist. • Algorithmic Variants. Sequoia allows the programmer to pro-

vide multiple implementations of a task and to specify the implementation to be used based on the context in which the task is called. • Parameterization. Tasks are expressed in parameterized form

to preserve independence from the constraints of any specific machine. Parameter values are chosen to tailor task execution to a hierarchy level in the machine. This collection of properties allows programs written using tasks to be portable across machines without sacrificing the ability to tune for performance.

2005/11/12

1 2 3 4 5 6 7 8 9

tion to CBVR parameter passing allows for efficient implementation via hardware block-transfer mechanisms and permits early initiation of transfers when the arguments are known in advance.

void __task matmul :: inner ( __in float A [ M ][ P ] , __in float B [ P ][ N ] , __inout float C [ M ][ N ] ) { // tunable p a r a m e t e r s specify the size // of blocks of A , B , and C . tunable unsigned int U ; tunable unsigned int X ; tunable unsigned int V ;

3.2 Isolation A key property of all tasks is isolation. A task operates entirely within a private address space containing only its working set. As a result, calling a task introduces a change of address space. The isolation of task address spaces implies that no constraints exist on whether a subtask must execute within the same level of the memory hierarchy as its calling task. A task has no means of communicating with other tasks executing concurrently on a machine (specifying parallel tasks is discussed in Section 3.4). As a result, although Sequoia applications consist of many parallel tasks, these concurrent tasks do not function as cooperating threads. The lack of shared state among tasks allows parallel tasks to be executed simultaneously on parallel execution units or sequentially on a single processor. Thus, the granularity of parallelism in Sequoia is the task. Task isolation also simplifies parallel programming by obviating the need for synchronization or locking constructs that are required when programming with cooperating threads.

10

// p a r t i t i o n m a t r i c e s // using a regular 2 D blkset Ablks = rchop ( blkset Bblks = rchop ( blkset Cblks = rchop (

11 12 13 14 15

into sets of blocks , chopping . A , U , X ); B , X , V ); C , U , V );

16

// Compute all blocks of C in p a r a l l e l . mappar ( int i =0 to M /U , int j =0 to N / V ) { mapseq ( int k =0 to P / X ) { // Invoke the matmul task r e c u r s i v e l y on // the sub - blocks of A , B , and C . matmul ( Ablks [ i ][ k ] , Bblks [ k ][ j ] , Cblks [ i ][ j ]); } }

17 18 19 20 21 22 23 24 25

}

26 27 28 29 30 31 32 33 34 35 36

void __task matmul :: leaf ( __in float A [ M ][ P ] , __in float B [ P ][ N ] , __inout float C [ M ][ N ] ) { // compute matrix product d i r e c t l y for ( int i =0; i < M ; i ++) for ( int j =0; j < N ; j ++) for ( int k =0; k < P ; k ++) C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; }

3.3 Task Variants

Figure 3. Matrix multiplication in Sequoia. matmul::inner and matmul::leaf are variants of the matmul task. 3.1

Explicit Communication And Locality

Arguments to a task may be scalar types (int, float, a struct, etc.) or multi-dimensional arrays of scalars. Each task argument must be designated as an input ( in), an output ( out), or readmodify-write ( inout) argument. Figure 3 shows the multiplication of matrices using Sequoia. In this example, the working set of the task matmul::leaf consists of MxP matrix A, PxN matrix B, and MxN matrix C. The matrices A and B are read-only arguments to task while the matrix C is a read-modify-write argument. As stated above, Sequoia tasks use call-by-value-result (CBVR) parameter passing semantics. CBVR has been known for decades [1] but is not common in modern languages. We observe that for execution on machines where data is transferred between distinct physical memories under software control, CBVR is the natural parameter passing semantics. When calling a task, data from the calling address space is copied into the callee’s address space. Upon return, modified argument data is copied back to the correct location in the calling task’s address space. Tasks run at a specific location in the machine, that is, while a task executes its entire working set (the collection of all data the task can reference) must be resident in a single node of the abstract machine tree. Pointers and references are not permitted within a task and therefore a task’s working set is manifest in its definition. As a result, defining tasks expresses locality in a program. When a task calls a subtask, the subtask may execute within the same memory module as its calling task or may be assigned to a child (often smaller) memory module that is closer to a compute processor. In the latter case, the subtask’s working set must be transferred between the two memory modules upon task call/return. Thus, the call/return of a subtask implies that data movement through the machine hierarchy might occur. Limiting communica-

3

Two implementations of the matmul task, matmul::inner and matmul::leaf are given in Figure 3. Each implementation is referred to as a variant of the task and is named using the syntax taskname::variantname. Line 22 of Figure 3 contains a recursive call to the matmul task, but the Sequoia syntax does not specify whether the matmul::inner or matmul::leaf variant is to be invoked. It is not possible to call a particular task variant from Sequoia code since this decision is made as part of the machinespecific mapping of an Sequoia algorithm (Section 4). Sequoia algorithms are expressed as hierarchies of tasks. Computation on arrays occurs within tasks at the leaves of this hierarchy. These tasks, called leaf tasks, do not call subtasks and operate on working sets stored within leaf levels of the memory hierarchy. Leaf tasks are similar to kernel functions featured in stream processing languages [26, 8]. For example, mapping a leaf task onto the elements of an array is similar to mapping a kernel onto a data stream. In Figure 3 matmul::leaf is a leaf task. Non-leaf tasks, referred to as inner tasks, are tasks that call subtasks. Inner tasks, such as matmul::inner, decompose computation into subtasks that operate on smaller working sets. Figure 4 illustrates one iteration of matmul::inner’s partitioning of a large matrix multiplication into smaller submatrix multiplications. The large matrices appearing in the upper box constitute the calling task’s working set. The recursive call to matmul on line 22 passes submatrices (denoted in dark gray) from the calling task’s address space as arguments to the subtask. Due to the change of address space induced by calling a task, the arguments are visible as smaller matrices to the subtask. Unlike leaf tasks, the working sets of inner tasks are resident at memory hierarchy levels that may be distant from the processor responsible for the inner task’s execution. Since in the Sequoia model communication is only allowed to occur at task call boundaries, inner tasks are restricted from directly accessing elements of aggregate data structures. Observe that while matmul::leaf computes on the elements of its arguments directly, matmul::inner does not reference elements of the matrices A, B, and C. Although task hierarchies force applications to be structured in a manner suitable for execution on software-managed hierarchical memory, they remain machine independent constructs. A inner task is not associated with any particular machine memory module;

2005/11/12

matmul_cluster_inst

Working set of matmul task (calling task): P 0

X

0

0

M

U

(U,X)

P

variant = inner U=1024 V=1024 X=1024 cluster level

N

N V

0

2V

0

0

X

U

(X,V)

M

2U

V

2V

matmul::inner matmul_node_inst

(U,V)

variant = inner U=128 V=128 X=128 node level

2U

A

C

B

matmul_L2_inst variant = inner U=32 V=32 X=32 L2 level

Working set of matmul subtask:

M

P

N

N

(0,0)

(0,0)

(0,0)

P

M

matmul::leaf

matmul_L1_inst variant = leaf L1 level

A

B

C

Parameterized Tasks

Figure 4. The matmul::inner variant calls subtasks to perform submatrix multiplications. Blocks of the matrices A, B, and C are passed to subtasks as arguments. These blocks appear as smaller matrices in the address space of the subtask. it may execute at any level of the memory hierarchy where its working set fits. 3.4

Task Decomposition

The utility of the Sequoia programming model lies in the use of inner tasks to partition computation into smaller subproblems. Before proceeding to discuss task parameterization (Section 3.5, we introduce Sequoia’s array blocking and task mapping functions; primitives designed to assist the programmer in describing task partitioning in a portable and high performance manner. A subset of the elements in array is referred to as an array block. For example, A[0:10] is the block corresponding to the first 10 elements of the array A. Blocks, like arrays, are multidimensional. The Sequoia array blocking functions produce an ordered collection of (potentially overlapping) blocks from a single array. This collection is encapsulated in an opaque object called a blkset. Sequoia has three built-in blocking functions: • The rchop (regular chop) function produces a collection of

blocks based upon arguments that describe the size and stride between blocks. The set of blocks, Ablks, defined in line 13 of Figure 3 represents the collection of blocks that results from using the rchop function to divide the matrix A into 2D blocks of size U by X. • Irregular blocking is supported by the ichop function, which

creates a collection of blocks whose endpoints are given by a specified index array. • The gather function creates blocks containing arbitrary array

elements using indices provided by an index array.

Task Instances

Figure 5. The static call graph for the matmul task is shown at left. Specialization to the cluster machine from Figure 2 creates multiple instances of this task. The resulting call graph among task instances is shown at right.

It is common to want to apply the same operation to each block in a collection. For example, matmul::inner calls matmul repeatedly on blocks of its input matrices. Mapping a subtask onto many blocks in succession is performed via the mappar (parallel iteration) and mapseq (sequential iteration) constructs. These constructs constitute highly constrained loops whose body may only contain a single task call. matmul::inner contains a three-dimensional iteration space (lines 18-25). Iterations from the two outermost dimensions of the mapping may be executed in parallel. The innermost dimension imposes a sequential ordering on subtasks. In each iteration, a block from each of the blksets is selected to be passed as an argument to the subtask (indexing a blkset selects a block from the collection). A mappar implies concurrency among subtasks but not asynchronous execution between calling and child tasks. All iterations of a parallel mapping sequence must complete before control flow returns to the calling task. Sequoia does not allow a subtask called as part of a parallel mapping sequence to write to a block that aliases any block used as an argument used by another task in the sequence. This restriction is required since parallel tasks may be executed in any order. Imperative C-style control-flow is allowed in inner tasks, but use of the blocking and mapping primitives is encouraged. These primitives allow for a number of key optimizations to be performed by a dynamic language runtime (described in Section 5) without the aid of static compiler support. Additionally, in the future, their use should facilitate more aggressive offline compiler scheduling of Sequoia applications. 3.5 Task Parameterization

Array blocks are passed as arguments to subtasks. Since each task call induces a change of address space, a block containing N array elements is made visible in the callee task’s address space as an array argument of size N . CBVR parameter passing permits multiple locations in a subtask’s address space to correspond to the same location in the address space of the calling task. In such situations, upon subtask completion, different subtask outputs are copied back to the same location in the caller’s address space. To avoid this situation, blocks passed as output arguments to a task are not permitted to alias.

4

Tasks are written in parameterized form to allow for specialization to the configuration of a target machine. Specialization is the process of creating an instance of a task that is customized to operate within a specific level of a target machine’s memory hierarchy. A task instance defines the code to execute (a task variant) and an assignment of values to all variant parameters. Task instances are created for each of the various contexts in which a task is used. For example, when running the task matmul on a cluster machine, one matmul instance is needed to partition matrices, distributed across the cluster, into submatrices that fit within in-

2005/11/12

dividual nodes. Additional instances are employed to decompose these datasets further into L2- and then L1-sized submatrices. The matmul::inner variant is used to conduct this partitioning in all cases (an instance of matmul::leaf multiplies L1-sized matrices), however the sizes of the subblocks created by each instance are different. In this example, the size of the subblocks is set by a parameter of the matmul::inner variant. Note that while parameterized tasks do not name specific variants when calling subtasks, the code of a specialized task instance does make direct calls to other instances. The static call graph for matmul’s parameterized task variants is shown in left side of Figure 5. The static call graph of the four task instances that result from specialization is should on the right of the figure. Notice that three instances, each mapped to a different location of the machine hierarchy, are created from the matmul::inner variant. Task variants utilize two types of numeric parameters, array size parameters and tunable parameters. Array size parameters, such as M, N, and P defined in the matmul task variants, represent sizes of array arguments passed to the task at runtime. Array size parameters may take on different values in each call to the task. Tunable parameters, such as integers U, V, and X declared in matmul::inner (lines 7-9 of Figure 3), are declared using the tunable keyword. Tunable parameters remain unbound in Sequoia source code, but are statically assigned values during task specialization. Once assigned, tunable parameters are treated as compiletime constants. The most common use of tunable parameters, as illustrated by the matrix multiplication example, is to specify the size of array blocks passed as arguments to subtasks. Parameterization allows the decomposition strategy described by a task variant to be used in a variety of contexts, making the task portable across machines or across levels of the memory hierarchy within a single machine. The use of tunable and array size parameters, and the support of multiple task variants, is key to decoupling the expression of an algorithm from it’s mapping to the underlying machine. Tasks provide a framework for defining the space of decisions that must be made during the process of application tuning. In the following section, we describe the process of tuning and targeting Sequoia applications to a machine. In addition to its impact on code portability, task parameterization facilitates the expression of generic locality-aware algorithms in forms that are similar to code containing explicit assumptions about the sizes or number of levels of a machine memory hierarchy. Unlike the cache oblivious approach to developing generic algorithms [18], Sequoia algorithms can be expressed so that decomposition occurs only when crossing memory hierarchy boundaries. There is no overhead of excessive recursion and no need for a compiler to determine when to terminate recursive calls.

4. Task Mapping Tasks are generic algorithms that must be specialized before they can be executed on a machine. Mapping a hierarchy of tasks onto a hierarchical representation of memory requires the creation of task instances for all machine levels. For each instance, a code variant to run must be selected, target instances to invoke at call sites must be chosen, and values for tunable parameters must be provided. A traditional approach to specialization would be to use a compiler to generate task instances for a target machine. We choose not to depend entirely on compiler analysis and instead provide the programmer complete control of the mapping and tuning phases of program development. A unique aspect of Sequoia is the task mapping specification that is created by the programmer on a permachine basis and is maintained separately from Sequoia source. Figure 6 is an example of the information provided to map matmul onto the cluster machine shown in Figure 2. The tunables have been chosen such that submatrices fit into each level of the memory hi-

5

instance { name = task = variant = runs_at =

matmul_cluster_instance matmul inner cluster_memory

tunable U = 1024 , V = 1024 , X = 1024 calls = matmul_node_instance } instance { name = task = variant = runs_at =

matmul_node_instance matmul inner node_memory

tunable U = 128 , V = 128 , X = 128 calls = matmul_L2_instance } instance { name = task = variant = runs_at =

matmul_L2_instance matmul inner L2_cache

tunable U = 32 , V = 32 , X = 32 calls = matmul_L1_instance } instance { name = task = variant = runs_at = }

matmul_L1_instance matmul leaf L1_cache

Figure 6. Specification for mapping the matmul task to the cluster machine from Figure 2.

erarchy (the workstations comprising our cluster feature a 512KB L2 and 16KB L1 cache). In addition to defining the mapping of a task hierarchy to a machine memory hierarchy, the mapping specification also serves as the location where the programmer specifies optimization and tuning directives that are particular to the capabilities of the intended target. A performance-tuned version of matmul is given in Figure 7. The matmul cluster instance runs at the cluster level of the machine hierarchy, so the distribution of array arguments across the machine has significant performance implications. The instance specifies that its argument matrices be distributed using a 2D block-block decomposition (Figure 8 illustrates how block-block decomposition assigns 2D regions of the matrices to cluster nodes). In addition to specifying which instance to call at matmul::inner’s call to matmul, matmul cluster instance specifies that the transfer of subtask arguments to the individual nodes should be double-buffered across mappar iterations to hide the latency of the transfers. As an additional optimization, the matmul L2 instance specifies that the system should copy the second and third arguments passed to matmul::leaf into contiguous buffers for better access coherence in the leaf task. Mapping specifications are intended to give the programmer precise control over the mapping of a task hierarchy to a machine while isolating machine specific optimizations to a single location. Performance on a machine is improved as details in the mapping specification are refined. We are not confident in the ability of a compiler, given parameterized task definitions alone, to automatically produce a quality mapping of the code to a machine. While an intelligent compiler may be capable of automating the creation of parts of a new mapping specification, Sequoia’s design relies on

2005/11/12

instance { name = matmul_cluster_instance task = matmul variant = inner runs_at = cluster_memory A distribution = 2 D block - block ( blocksize 1024 x1024 ) B distribution = 2 D block - block ( blocksize 1024 x1024 ) C distribution = 2 D block - block ( blocksize 1024 x1024 ) tunable U = 1024 , V = 1024 , X = 1024 callsite 0: calls = matmul_node_instance partition = 4 x4 grid arg 0 = double - buffer arg 1 = double - buffer arg 2 = double - buffer

the performance oriented programmer to determine the key aspects of this mapping to achieve maximum performance. The initial port of an Sequoia application to a new machine requires the creation of a new mapping specification. We believe that the programmer effort required to create a mapping is justified. Although the well-tuned mapping specification for matrix multiplication exceeds the length of the Sequoia code to express the algorithm itself, the amount of overall work required of the programmer is far less than if he was forced to design implementations of matrix multiplication for multiple platforms, such as a single PC and a cluster of PCs, independently. We believe our approach to the separation of algorithmic expression from machine-specific performance tuning is a better way of developing portable algorithms without sacrificing performance.

} instance { name = task = variant = runs_at =

5. Runtime Implementation matmul_node_instance matmul inner node_memory

tunable U = 128 , V = 128 , X = 128 callsite 0: calls = m a t m u l _ L 2 _ i n s t a n c e arg 0 = no - copy arg 1 = no - copy arg 2 = no - copy } instance { name = task = variant = runs_at =

matmul_L2_instance matmul inner L2_cache

tunable U = 32 , V = 32 , X = 32 callsite 0: calls = m a t m u l _ L 1 _ i n s t a n c e arg 0 = no - copy arg 1 = copy arg 2 = copy } instance { name = task = variant = runs_at = }

matmul_L1_instance matmul leaf L1_cache

Figure 7. A tuned version of the mapping specification from Figure 6. The cluster instance now declares a distribution of its working set among cluster nodes and requires double-buffering to be used when communicating submatrices between nodes. Within a node, submatrices are to be copied into contiguous regions of memory only when calling leaf tasks.

To expedite the development of a complete system capable of running Sequoia code on multiple target platforms, we elected to start with a C++ runtime responsible for implementing the abstractions required by tasks. We modified the C++ front-end provided by Elsa [27] to parse Sequoia syntax and emit C++ code containing calls to our runtime API. This process is strictly a syntax transformation, no static analysis of Sequoia code is performed, thus the front-end merely alleviates the burden of writing directly to the runtime API. A sophisticated compiler solution was not required to perform the experiments discussed in this paper. Figure 9 illustrates the major components of our system. Since parallel microprocessors that expose on-chip storage to direct software control, such as the Cell Broadband Engine[28], are not readily available, we targeted the Sequoia runtime for two common computing platforms, a single PC workstation and a cluster of workstations. The runtime implementation is responsible for the efficient execution of task instances on the underlying machines. In particular, this involves scheduling the transfer of data to the machine locations expected by tasks and scheduling the execution of tasks onto the machine’s processing units. At application startup, the runtime loads a description of the host machine hierarchy as well as a task mapping specification specific to the given machine. Currently, the machine and mapping specifications are supplied to the runtime in the form of XML files. When an Sequoia task is called, the runtime consults the mapping to determine which instance to select and the machine location where it should be run. The runtime copies task argument data to this location as necessary before invoking the instance. Since machine specific details are confined to the mapping specification file, no source level changes, and thus no recompilation, is necessary to port Sequoia applications among the target configurations supported by the two runtimes. Only a unique mapping specification is required for each target architecture. 5.1 Single Workstation Runtime

N 0

1

2

3

4

5

6

7

8

9

10 11

12

13

14 15

M

Distributed Matrix

Figure 8. A block-block distribution assigns 2D blocks of matrices to each workstation in the cluster. The id of the workstation responsible for storing each block is shown.

6

The single node runtime executes Sequoia applications on a uniprocessor PC. This implementation is used both as a reference and to validate that locality aware programs structured as task hierarchies realized performance benefits on conventional architectures. The single node runtime executes all tasks sequentially and does not attempt to prefetch data into CPU caches. The transfer of data between task address spaces is implemented as a memcpy. Efficiently implementing the Sequoia language’s CBVR parameter passing semantics was the primary challenge of this implementation. Since an Sequoia task call chain may grow arbitrarily deep before reaching a leaf task, performing a copy of data into each task’s address space cripples performance and is wasteful of memory. In general, data copies are desired when a subtask’s working set is sized to fit within a CPU cache closer to the processor. To

2005/11/12

Sequoia application Sequoia Tasks

C++ code

Mapping Spec

Machine Description

15

(.xml)

(.xml)

Sequoia front-end

Speedup

10

Sequoia C++ API

5

Sequoia Runtime PC Runtime

Cluster Runtime

Figure 9. The Sequoia runtime system: Our front-end transforms Sequoia task definitions into C++ code that links against the Sequoia runtime. At application startup, the runtime requires machine description and mapping specification files that describe how the application should be run on the given machine. avoid extra overhead, the runtime delays the copy of task arguments until direct access to the data is required (by a leaf task) or unless the copy is explicitly specified by the programmer. Whether or not a copy of argument data should occur during a task call is configurable on a per-argument basis in the machine mapping. When copies do occur, the runtime dynamically detects situations when the same argument is used in back-to-back task calls and is able to remove argument copy-out and copy-in operations in these situations. To minimize runtime overhead, our system does not attempt to detect argument reuse over larger windows. We expect that in many situations static data-flow analysis in a compiler could yield a more efficient schedule of data movement than what is obtainable by our runtime implementation. 5.2

Matmul Sort Iterconv2D

Cluster Runtime

Although a cluster of workstations has no memory that is directly addressable by all processors, we model a cluster of N workstations as a tree of memories rooted by a globally visible virtual address space. Therefore, task instances assigned to this level of the virtual hierarchy operate on arrays physically distributed across the cluster. The memory hierarchies of the individual workstations constitute subtrees of the root cluster level, and the cluster runtime manages the execution of task instances mapped to this level of the machine. Communication between workstations is implemented by the runtime using nonblocking MPI messages. The cluster runtime leverages the single node runtime to execute task instances mapped to levels contained within a single workstation. Implementing the Sequoia runtime on a cluster of workstations presented two key challenges. First, we were required to implement a unified address space abstraction on top of physically distributed memories. Second, the implementation must map multiple levels of parallel tasks onto a flat collection of machines. Sequoia’s design made the implementation of a single address space abstraction feasible without significant performance penalties. Since inner tasks do not access array argument data elements directly, no fine-grained accesses to remote data occurs. Only when program control moves from a task at the cluster level to a task instance mapped to a single cluster node is the runtime required to transfer array data to that node. Sequoia’s CBVR parameter passing semantics minimize network communication in system. The

7

0 0

5

10

15

# of nodes

Figure 10. Application speedup due to execution on up to 16 processors on our cluster (communication via Infiniband interconnect).

system transfers all task argument data to a node in batches. Only when the task completes are the output arguments transferred back to the appropriate location in the cluster. As a result, once a task begins, all data it accesses is already resident on the local node. To hide the latency of the block transfer of arguments, the runtime prefetches remote arguments for future tasks while the current task is executing. The amount of prefetching to perform as well as the distribution of arrays in the cluster-wide address space are configurable as part of the mapping specification. At the root of an Sequoia task hierarchy there is little parallelism. Application parallelism grows as tasks decompose into parallel collections of subtasks, but several levels of decomposition might be necessary to reach the amount parallelism required to saturate the number of processing nodes in the machine. Tasks are assigned to physical nodes using the following scheme. A pool of cluster nodes is responsible for computing every task at the cluster level. For example, all nodes in the cluster are responsible for computing the root task in a hierarchy. When a task decomposes into parallel subtasks (a mappar), the pool of nodes is partitioned into groups that are assigned responsibility for performing some of the subtasks. Since there is no parallelism within a task until a mappar is encountered, only one node from a pool executes a task. When this node reaches a mappar, it notifies other nodes in the pool, instructing them to begin execution on a subset of the parallel subtasks. Parallel execution proceeds from this point until the mappar is completed. Like all other machine specific details, the partitioning of parallel iteration spaces among nodes in a pool, and the details of the breakdown into subpools are governed by the mapping specification. Although nothing in the design of Sequoia prevents us from doing so, our implementation does not perform dynamic reassignment of tasks to processing nodes to obtain better load balancing.

6. Evaluation To conduct an initial evaluation of the performance of Sequoia applications and to access their adaptability to different architectures, we measured the performance of several algorithms expressed in Sequoia on both runtimes.

2005/11/12

Matmul computes the product of dense MxP and PxN matrices using a blocked algorithm. The matmul task discussed throughout this paper constitutes our matrix multiplication implementation. Iterconv2D performs multiple iterations of processing on a 2D dataset using a filter with 5x5 support. The output of each iteration is used as the input signal in the next. When executed using the cluster runtime, communication between processors is required in between each iteration. Sort employs a recursive bitonic sort algorithm to sort a list of numbers in parallel. Recursion is truncated after lgP steps, where P is the number of processors. A sequential N lgN sort operation (array size N ) is performed as the leaf task on each of the processors. Combining the results of the parallel operations induces a complex interchange of data between nodes. Our workstation target has dual 2.4GHz Intel P4 Xeon processors with 1GB of main memory. Our cluster setup consists of 16 workstations of this configuration connected by Gigabit Ethernet and Mellanox PCI-X 4X Cougar Infiniband HCA interconnects. Both runtimes use only one processor per node for computation. We compared the performance of matmul written in Sequoia with versions of blocked matrix multiplication written in C and with Intel’s commercial implementation in the MKL library[21] on a single node. The matmul task is specialized into 3 instances so as to perform blocking into the processor’s L2 (128x128 matrices) and L1 cache (32x32 matrices). The Sequoia implementation multiplied 4096x4096 matrices at a rate of 731 MFLOPS compared to 800 MFLOPS for the C implementation. The Sequoia implementation is within 10% of the speed obtained by handwritten C code that used the same blocking strategy, demonstrating the low overhead of our runtime. Improving the implementation of matmul::leaf with hand-tuned code for the Pentium 4 processor increased overall performance to 3.0 GFLOPS, about 45% the performance obtained by the highly optimized MKL library. More global optimizations, including restructuring the inner tasks, could further improve performance, potentially at the cost of portability or performance on other architectures. However, if a highly tuned library is available for a target architecture, the leaf task should directly make use of it for maximum performance on that machine. Because architecture specific tuning was limited to the leaf task, and no architecture specific tuning was performed on the inner tasks, porting matrix multiplication to run on the cluster required no source level changes to the algorithm. We added a single task instance definition to the mapping specification to define how to operate upon matrices resident at the virtual cluster level of the hierarchy. This instance describes the distribution of the matrices across the cluster as well as the assignment of the parallel submatrix multiplications to the cluster nodes. Figure 7 lists the information contained in the complete cluster mapping file. The definition of matmul cluster instance contains the additions that were added specifically for cluster execution. Using Sequoia’s generic mechanisms to subblock the matrices yields the same performance effects as manually blocking an iteration space in C. However using Sequoia, the exact same blocking strategy can be used to distribute the computation across a cluster. Figure 10 graphs speedup of each of the algorithms due to execution on an increasing number of cluster nodes (using Infiniband interconnect). Dataset sizes were held fixed in all runs and the same Sequoia source is used in both the single node and various cluster specialized versions of each application. Matrix multiplication demonstrates a speedup of 11.1x on 16 nodes, achieving 33.5 GFLOPS, for the chosen matrix dimensions M =4096, N =4096, and P =8192. As stated earlier, obtaining this speedup required only minor additions to the mapping specification for a single node ma-

8

chine. Alternatively, once we use Sequoia to provide a strategy for decomposing the computation across the nodes, we can directly use MKL for the node level, and achieve 77 GFLOPS on 16 nodes. For matrix multiply, the runtime is able to overlap communication between nodes with computation. We observed similar results (10.9x speedup on 16 processors) running iterconv2D on a dataset consisting of 8192x8192 elements. As with matmul, specializing the algorithm to a cluster machine with P nodes only required additional information describing the distribution of data and parallel subtasks among the nodes. Bitonic sort scales as P/lgP in the number of nodes P , and we are in agreement with expectation with a speedup of 2.7x using 16 nodes. An improved runtime implementation would increase sorting performance closer to the expected 16 node speedup of roughly 4. Lastly, we forced network communication to use the Gigabit Ethernet interconnect instead of Infiniband and re-tuned the applications for the new cluster configuration. Maximum node-to-node bandwidth available over the Gigabit Ethernet network (85MB/sec) is nearly an order of magnitude lower than afforded by Infiniband links (760MB/sec). Experimentally, we determined the optimal mapping of iterconv2D required different values for the algorithm’s tunable parameters depending on the choice of interconnect. For example, for 16 nodes, the optimal tunable values for Infiniband are larger than those for Gigabit Ethernet. Using the larger tunable values with Gigabit Ethernet causes a 15% decrease in performance as compared to using the optimal, smaller, tunable values for this interconnect. The optimal values of tunable parameters for matrix multiplication and sorting did not change. However, in the case of matrix multiply, increasing the working set size for the node level task achieves close to peak performance much faster under Infiniband than Gigabit Ethernet. Modifying tunables did not require any modification to the algorithm definition or recompilation, and was done purely through the mapping files.

7. Related Work Automatic restructuring of programs to improve data locality is possible for affine programs [25]. While this type of analysis aims to tackle similar performance goals as our work, it does not currently provide mechanisms for distributed systems and rich memory hierarchies, nor does it work with the many non-affine programs available. There have been many attempts to incorporate explicit data locality into parallel programming models. Split-C [12], UPC [10], and Titanium [31] present a single program address space, but seek to minimize communication between processors by designating memory that is local or remote to each program thread. Stream processing languages [26, 8] also build upon a two-tiered memory model [24], choosing to differentiate between on and off-chip storage. Modern parallel language efforts [11, 9, 2] support for locality cognizant programming through the concept of distributions (from ZPL [15]). A distribution is a map of array data to a set of machine locations, facilitating a single program namespace despite execution on nodes with physically distinct address spaces. Distributions fail to describe movement of array data up and down the memory hierarchy and are not applicable when arrays are not stored across distributed memories. Our decision to use hierarchical memory model brings explicit control of communication between nodes and within a node into a common framework. The notion of modeling hierarchical memory is not new. The Uniform Memory Hierarchy Model (UMH) [3] abstracted uniprocessor machines as sequences of memory modules of increasing size. The Parallel Memory Hierarchy (PMH) [4] extended this abstraction to parallel architectures by modeling machines as trees of memories. Historically, interest in non-uniform memory access models has been motivated by the analysis of algorithm perfor-

2005/11/12

mance [22, 29]. Instead, we view hierarchical memory as a fundamental aspect of our programming model required to achieve both performance and portability across a wide range of architectures. Our design is heavily influenced by the idea of space limited procedures [5], an approach for programming machines modeled using the PMH model. A task is a more abstract concept than a space limited procedure (the similarity to a task instance is greater), but key ideas (and terminology) regarding variants and task specialization is adopted from Alpern et al’s work. The Chameleon [6] and CHORES [16] systems also bear similarities to Sequoia in their runtime-based approach to dividing computation into hierarchies of simple operations called chores. These systems were intended to ease parallel programming on shared memory systems, and do not deal with issues such as explicitly naming working sets and address space isolation that we desire. Sequoia tasks are a generalization of stream programming kernels [26, 8]. Tasks and kernels share similarities such as isolation, a local address space, and well specified working sets, but differ by operating upon collections of data not single records, and by ability of tasks to arbitrarily nest. Task hierarchies facilitate explicit expression of locality that stream compilers have struggled to find via automatic analysis. Sequoia’s control flow when encountering a parallel mapping of subtasks resembles the non-SPMD style of concurrency in Cilk [7], X10 [11], and Fortress [2]. Sequoia control flow is constrained in comparison to these languages since the calling task cannot proceed until all subtasks complete (similar to OpenMP [13] loops). Cilk introduced generic concurrency model to facilitate the implementation of sophisticated dynamic scheduling of workloads. We leverage the flexibility of non-SPMD control flow to achieve portability goals. It is well known that a divide-and-conquer strategies lead to algorithms that exhibit high levels of locality [20]. Cache-oblivious algorithms [18, 19] make provably efficient use of a machine’s memory hierarchy without regard to the particular size or number of hierarchy levels. The portability of the cache-oblivious approach comes at the cost of lost constant factors in performance and added complexity of expressing algorithms in a cache-oblivious manner. Sequoia algorithms are machine independent, and written in a divide-and-conquer style, but not necessarily cache oblivious. Sequoia provides portability across machines featuring widely varying communication mechanisms and rich memory hierarchies by allowing the programmer to describe multiple address spaces, while cache oblivious algorithms written in a traditional programming languages do not. Meta-compilation of parameterized algorithms has been used to automatically tune matrix multiplication [30] and FFT [17] libraries to machines by a heuristic-guided search through a domain specific parameter space. Task parameterization via tunables and task variants implicitly defines a parameter space for every Sequoia application. It is not inconceivable that similar search techniques could be employed to automatically generate Sequoia mapping files for a new architecture.

increasing optimizations that is already required of programmers tuning for performance, and the optimizations are often performed in an ad-hoc or machine specific manner. Sequoia’s mechanisms permit these optimizations to be described in a portable way, while still allowing for aggressive tuning. Sequoia’s design is a pragmatic approach to reconciling the conflicting goals of performance and portability. We give the programmer direct control over the optimization of an algorithm to a specific machine, but isolate these decisions to a mapping specification separate from the algorithmic definition. Using this approach can fully take advantage of available automatic tuning techniques in a compiler, yet also allowed us to demonstrate a functioning system that requires no static analysis. Moreover, the same source code was used to run efficiently on two different platforms with minimal additional effort from the programmer. One simple extension to our current work would be to adapt the runtime to target SMP systems as well as multicore and multithreaded processors as they are becoming more common. With runtime or compiler support, Sequoia could provide a way to leverage the processing power available in these systems without the programmer directly using parallel mechanism such as threads or OpenMP. We have demonstrated our system running on the largest clusters we had at our disposal. Nothing inherent in the design of the runtime or language would prevent us from targeting larger machines, and we would like to explore the scalability of our cluster runtime on to much larger systems, potentially targeting ASC class systems. A major focus of future work will be the implementation of the Sequoia runtime on exposed communication architectures like Cell and Merrimac. We have demonstrated runtime execution on two traditional architectures, a single PC and a cluster of PCs, and although the types of program transformations Sequoia encourages are important to obtaining performance on these platforms they are critical for correctness when running on processors with software managed memories. In addition, the lack of a mature, established programming model for the emerging class of exposed communication architectures makes Cell-based systems an attractive target upon which to test our work. Although our current system requires manual task specialization and utilizes a dynamic runtime system to execute Sequoia code, both the process of developing Sequoia applications and the ultimate performance achieved by Sequoia code would benefit from a static compiler solution. Offline task scheduling and memory allocation algorithms should be capable generating schedules for data movement and task execution that are superior to the decisions made under the constraints of a dynamic runtime system. Static data-flow analysis would assist in the elimination of redundant data transfers at task call/return boundaries, lifting the responsibility of data reuse detection from the runtime. Additionally a mature compiler might automatically generate sections of mapping machine specifications, lessoning the amount of manual work required to tune Sequoia applications. An effort to develop an Sequoia compiler with a focus on aggressive static task and data scheduling is currently ongoing in our group.

8. Discussion and Future Work The Sequoia programming model is built around the idea that improving the productivity of the performance conscious programmer requires placing performance critical aspects of the machine under their direct control. At the same time, to ensure portability, the separation of algorithmic expression and machine specific tuning remains fundamental in our design. The explicit expression of communication, the movement of data through the memory hierarchy, independent computation, and the definition of working sets is achieved through a single abstraction, the task. Structuring computation as hierarchies of tasks is a general way of performing locality

9

References [1] A. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986. [2] E. Allen, D. Chase, V. Luchangco, J.-W. Maessen, S. Ryu, G. Steele, and S. Tobin-Hochstadt. The fortress language specification version 0.707. technical report. Sun Microsystems, 2005. [3] B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierarchy model of computation. Algorithmica, 12(2/3):72–109, 1994.

2005/11/12

[4] B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. In Proc. Programming Models for Massively Parallel Computers, 1993. [5] B. Alpern, L. Carter, and J. Ferrante. Space-limited procedures: A methodology for portable high performance, 1995. [6] G. A. Alverson and D. Notkin. Program structuring for effective parallel portability. IEEE Trans. Parallel Distrib. Syst., 4(9):1041– 1059, 1993. [7] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th Symposium on Principles and Practice of Parallel Programming, 1995. [8] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777–786, 2004. [9] D. Callahan, B. L. Chamberlain, and H. P. Zima. The cascade high productivity language. In Ninth International Workshop on HighLevel Parallel Programming Models and Supportive Environments, pages 52–60. IEEE Computer Society, 2004. [10] W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to upc and language specification. University of California-Berkeley Technical Report: CCS-TR-99-157, 1999. [11] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 519–538, New York, NY, USA, 2005. ACM Press. [12] D. E. Culler, A. C. Arpaci-Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. A. Yelick. Parallel programming in split-c. In Supercomputing, pages 262–273, 1993. [13] L. Dagum and R. Menon. Openmp: An industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng., 5(1):46–55, 1998.

symposium on Theory of computing, pages 326–333, New York, NY, USA, 1981. ACM Press. [23] U. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany. The Imagine stream processor. In Proceedings 2002 IEEE International Conference on Computer Design, pages 282–288, Sept. 2002. [24] F. Labonte, P. Mattson, I. Buck, C. Kozyrakis, and M. Horowitz. The stream virtual machine. In Proceedings of the 2004 International Conference on Parallel Architectures and Compilation Techniques, Antibes Juan-les-pins, France, September 2004. [25] A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In PPoPP ’01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, pages 103–112, New York, NY, USA, 2001. ACM Press. [26] P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2002. [27] S. McPeak and D. Wilderson. Elsa: The Elkhound-based C/C++ Parser. http://www.cs.berkeley.edu/˜smcpeak/elkhound, 2005. [28] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa. The design and implementation of a first-generation cell processor. In IEEE International Solid-State Circuits Conference, 2005. [29] J. S. Vitter. External memory algorithms. pages 359–416, 2002. [30] R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001. [31] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. In ACM, editor, ACM 1998 Workshop on Java for High-Performance Network Computing, New York, NY 10036, USA, 1998. ACM Press.

[14] W. J. Dally, F. Labonte, A. Das, P. Hanrahan, J.-H. Ahn, J. Gummaraju, M. Erez, N. Jayasena, I. Buck, T. J. Knight, and U. J. Kapasi. Merrimac: Supercomputing with streams. In SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 35, Washington, DC, USA, 2003. IEEE Computer Society. [15] S. J. Deitz, B. L. Chamberlain, and L. Snyder. Abstractions for dynamic data distribution. In Ninth International Workshop on HighLevel Parallel Programming Models and Supportive Environments, pages 42–51. IEEE Computer Society, 2004. [16] D. L. Eager and J. Jahorjan. Chores: enhanced run-time support for shared-memory parallel computing. ACM Trans. Comput. Syst., 11(1):1–32, 1993. [17] M. Frigo. A fast Fourier transform compiler. In Proc. 1999 ACM SIGPLAN Conf. on Programming Language Design and Implementation, volume 34, pages 169–180. ACM, May 1999. [18] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS ’99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society. [19] M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS ’05: Proceedings of the 19th annual international conference on Supercomputing, pages 361–366, New York, NY, USA, 2005. ACM Press. [20] F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev., 41(6):737–756, 1997. [21] Intel. Math kernel library. http://www.intel.com/software/products/mkl, 2005. [22] H. Jia-Wei and H. T. Kung. I/o complexity: The red-blue pebble game. In STOC ’81: Proceedings of the thirteenth annual ACM

10

2005/11/12