Thesis Proposal: Scheduling Parallel Functional ... - Semantic Scholar

2 downloads 73145 Views 282KB Size Report
In this thesis, I propose to use a cost semantics to allow programmers to reason .... in hardware performance, every desktop application must become a parallel.
Thesis Proposal: Scheduling Parallel Functional Programs Daniel Spoonhower Committee:

Guy E. Blelloch (co-chair) Robert Harper (co-chair) Phillip B. Gibbons Simon L. Peyton Jones (Microsoft Research)

(Revised June 13, 2007) Abstract Parallelism abounds! To continue to improve performance, programmers must use parallel algorithms to take advantage of multi-core and other parallel architectures. Existing declarative languages allow programmers to express these parallel algorithms concisely. With a deterministic semantics, a declarative language also allows programmers to reason about the correctness of programs independently of the language implementation. Despite this, the performance of these programs still relies heavily on the language implementation and especially on the choice of scheduling policy. In this thesis, I propose to use a cost semantics to allow programmers to reason about the performance of parallel programs and in particular about their use of space. This cost semantics also provides a specification for the language implementation. In my previous work, I have formalized several implementations, including different scheduling policies, as small-step transition semantics. Incorporating these policies into the language semantics establishes a tight link between programs, scheduling policies, and performance. Using these semantics, I have shown that in some cases, the choice of scheduling policy has an asymptotic effect on memory use. In my continuing work, I will consider extensions to my language and develop a full-scale implementation. With these, I hope to demonstrate that a declarative language is a practical way to program parallel algorithms and that my cost semantics offers an effective means to reason about their performance.

1

Introduction

The goal of this thesis is to show that declarative programming languages are an effective means to express parallel algorithms. Declarative languages relegate the low-level details of a parallel implementation to the language implementation, freeing programmers to seek out opportunities for parallel execution and to focus on ensuring program correctness. By abstracting away from the concrete aspects of the implementation and architecture, declarative languages facilitate the development of parallel programs. Declarative programs can also be realized on a variety of parallel architectures. Understanding the behavior of programs in light of these different implementations, however, requires a clear description of the language semantics. In this thesis, I advocate for a semantics based on a deterministic model of parallel execution. Under such a semantics, a parallel program will always yield the same result regardless of the underlying architecture and language implementation. Thus parallelism is merely a means to achieve good performance and can be ignored for the purposes of reasoning about program correctness. 1

Functional programming provides a good foundation for expressing parallel algorithms because it offers a natural means to achieve a deterministic semantics. The lack of side effects ensures that the behavior of each parallel task can be determined independently of other tasks. The language implementation can interleave and evaluate parallel tasks in an arbitrary order without affecting the results. Despite this, the implementation of a functional programming language cannot be ignored entirely. While one can analyze the running time and space use of functional programs, subtle aspects of the language implementation can render these analyses meaningless. For example, it is well known that na¨ıve implementations of sequential functional languages can asymptotically increase the space complexity of some programs (e.g., Shao and Appel [1994], Gustavsson and Sands [1999]). The performance of parallel functional programs often hinges upon the scheduling policy, the mapping of parallel tasks to physical processors. In this thesis, I focus on how the scheduling policy affects the space use of parallel programs. While there is a wealth of research on different scheduling policies and their effects on memory usage (e.g., Blumofe and Leiserson [1998], Blelloch et al. [1999], Narlikar and Blelloch [1999]), I plan to study scheduling from the point of view of the language implementation. This requires a careful analysis of the compilation of functional programs as well as a close integration of the scheduling policy with the parallel semantics. There has also been significant work on semantics for high-level, parallel languages (e.g., Hudak and Anderson [1987], Roe [1991], Blelloch and Greiner [1995], Aditya et al. [1995], Greiner and Blelloch [1999]), but none of it has attempted to capture implementation details such as scheduling policy. In only one instance [Blelloch and Greiner, 1996] has this work attempted to reason about the space use of programs. In this thesis, I propose to fill this gap between scheduling policy and language semantics by designing a set of semantics that each describe the behavior of a particular policy. Furthermore, by building scheduling into the language definition, I will provide a means to reason about the performance of functional parallel programs.

1.1

Overview

Before giving my thesis statement, I briefly review several key topics relevant to this thesis. Functional Parallel Languages In the bulk of this proposal, I consider a pure functional language, i.e. a language without side effects. This language allows a form of fork-join parallelism where the elements of pairs and arrays may be computed in parallel. Due to the lack of effects, each element may be evaluated independently of any of the others. Thus every parallel execution of a program will yield the same result: choosing among different parallel implementations is a matter of performance and will not affect program correctness. Parallel Scheduling In a parallel languages such as this, there are often many more opportunities for parallel execution than can accommodated by the underlying hardware platform. The way in which parallel tasks are prioritized can have a dramatic, and even asymptotic, effect on performance. A scheduling policy determines the priorities of parallel tasks and assigns tasks to physical resources. This assignment may be determined on-line by part of the language runtime implementation, or it may be determined off-line as part of a compiler transformation. Cost Semantics To provide a means of comparing the space use of these different implementations, I use a cost semantics. A cost semantics is a refinement of the usual notion of a dynamic semantics that distinguishes programs based on their intensional behavior. It yields not only the result of evaluation, but also a measure of how this result is computed. My cost semantics assigns an abstract cost to each program in the source language. This cost is parameterized in such a way that it can be used to analyze the use of space under different scheduling policies.

2

Provable Implementations While this cost semantics allows the programmer to compare behavior of different source programs, these comparisons are only meaningful if the language implementation reflects these costs. Thus a cost semantics also acts as a specification for the implementation. Any implementation that meets this specification is called a provable implementation. In this thesis proposal, I describe implementations of several scheduling policies. In each case, I prove a correspondence between the analysis of the abstract cost and the behavior of the implementation. The aim of this thesis is to show that together, these techniques can be effective tools for expressing and implementing parallel algorithms. Thesis Statement A cost semantics provides a powerful and elegant means to reason about the use of space in parallel functional programs and to guide provably space-efficient implementations, including parallel scheduling policies. To substantiate this thesis, I will build upon my previous work on cost semantics and scheduling. I will show how parallel algorithms can be expressed in a high-level language and how this language can be implemented efficiently. My continuing work will focus on three areas. 1. Language Extensions – While I have formalized costs and implementation for a purely functional parallel language, many algorithms require effectful language features such as mutable state. I will consider how these features can be integrated into my framework. 2. Implementation – While I have given several implementations of my language, these implementations are still relatively abstract. It remains to build a concrete implementation that executes programs on parallel hardware. 3. Applications – To witness the potential of my work, I will implement several parallel algorithms in my functional language and analyze the performance of these implementations using both my cost semantics and empirical methods. In the remainder of this proposal, I give a more detailed motivation for this thesis, a survey of related work, and an account of my previous work. This work consists of two parts: first, a cost semantics that allows a precise analysis of space use, and second, a set of language implementations that each embody a different scheduling policy. Throughout this exposition, I use a series of examples to demonstrate this analysis and illustrate the effects of scheduling on the use of space. Finally, I conclude with specific plans for the completion of this thesis.

1.2

Motivation

Parallel computing has become ubiquitous. While parallel architectures were once used predominately in scientific simulation and other high-performance applications, parallelism is now considered to be the primary means to improve the performance of commodity microprocessors. Today’s laptops, workstations, and gaming consoles boast at least two and as many as eight cores per processor. Intel predicts that within five years they will deliver chips with 80 cores [Intel]. While it is conceivable that a handful of these cores might be occupied by different applications, the amount of available parallelism will quickly outstrip the number of applications typically run by users. To continue to scale with advances in hardware performance, every desktop application must become a parallel program. The number of applications will not continue to increase at the same rate as the number of cores and processors, but the amount of data manipulated by users certainly will: personal computers are becoming massive repositories for video, sound, photographs, communication, and other text. To keep pace with the number of processors, applications must process these data in parallel. Three-dimensional rendering is an example of an application where parallelism has been used to achieve better performance. Graphics processors are designed precisely to take advantage of the natural parallelism 3

of 3D rendering, and programs use this parallelism to render more pixels and more realistic images. Modern graphics cards now offer a tremendous amount of parallel computing power, reaching into the 100s of GFLOPS [Owens et al., 2007]. While graphics processors are somewhat limited in what kinds of calculations they can perform, there is increasing interest in finding additional uses for these computational resources. To take advantage of this wide range of parallel platforms, including multi-core processors, multi-processor systems, and graphics processors, programs must be written in a platform-independent language. I claim that any such parallel language must be declarative and give only a high-level description of where parallel evaluation may occur. Once details of the input data and platform have been established at runtime, the language implementation can determine the exact mapping of parallel tasks to processors or processing elements. While elements of the input data may be processed concurrently, it is still sensible to ask, what would the result be if these data were processed sequentially? This question is critical as it defines the means by which programmers can reason about the behavior of parallel programs. It is also a significant constraint on any parallel implementation. It stipulates that evaluation is deterministic, that every parallel implementation must yield the same result. While determinism is sufficient to reason about the correctness of a parallel application, it does not tell us anything about program performance. Nor does it allow us to compare the performance of a program with respect to different language implementations. A cost semantics provides a vehicle by which we can make these comparisons. Using my cost semantics, I have shown several examples where the choice of scheduling policy has a dramatic, and even asymptotic, effect on the space usage of parallel programs. In the remainder of this section, I present a small example and give an informal analysis of its space use. In subsequent sections, I will develop the formalisms needs to carry out this analysis more rigorously. Example: Quicksort Consider the following implementation of quicksort where the input list is partitioned and each half is sorted in parallel. In this work, the components of a pair { e1 , e2 } may be evaluated in parallel. fun qsort xs = case xs of nil ⇒ nil | [x] ⇒ [x] | x:: ⇒ append {qsort (filter (le x) xs), qsort (filter (gt x) xs)} Note that this is a persistent version of quicksort: the original list is preserved and each intermediate list is freshly allocated. While this code does not necessarily represent the most efficient persistent implementation, it certainly is one reasonable possibility. As such, we would like to understand its performance across a set of language implementations. Figure 1 shows an upper bound on the use of space as a function of the input size. Each line represents a variation in the number of processors or in the scheduling policy used to prioritize parallel tasks. I will describe these policies (“depth-first” and “breadth-first”) in greater detail below. Note here only that the first two configurations consume space as a polynomial function of the input size, while the latter two require only linear space. Also, note that these plots all represent the execution of the same program, each with a different implementation of the language. I briefly draw a connection with another important component of a language implementation, the garbage collector. Programmers who use garbage-collected languages must consider the effect of the collector on application performance. Switching between collector algorithms or changing the configuration of a given algorithm can have a significant impact on end-to-end performance. No single collector algorithm is appropriate for all applications. In much the same way, I claim that programmers who desire to understand the performance of parallel applications must consider the effect of the scheduling policy. In the sequel, I will give examples that demonstrate that no single policy is best for all applications. 4

Figure 1 Space Use as a Function of Input Size. This figure shows an upper bound on the space required to sort a list as a function of input size. Each configuration differs in the number of processors or the scheduling policy used to prioritize parallel tasks. These policies will be described in Section 4. 450

1 PE, depth-first 2 PE, depth-first 3 PE, depth-first 1-3 PE, breadth-first

space high-water mark

400 350 300 250 200 150 100 50 0 0

2

5

10 input size

15

20

Related Work

Parallel Languages Interest in side effect-free parallel programming began decades ago with languages such as Val [McGraw, 1982], Id [Arvind et al., 1989], and Sisal [Feo et al., 1990]. Like the current proposal, these researchers argue that a deterministic semantics is essential in writing correct parallel programs and that a language without side effects is a natural means to achieve such a semantics. Several researches have considered speculative evaluation as a means to achieve better performance in functional programming (e.g., Hudak and Anderson [1987], Aditya et al. [1995], Blelloch and Greiner [1995], Greiner and Blelloch [1999]). Roe [1991] advocates for a form of explicit parallelism in a functional language and also gives an abstract performance analysis of functional parallel programs. While most of his work focuses on a call-by-need language, his analyzes the performance of programs in a language with a strict form of parallelism. Efficient implementations of strict, nested parallelism were considered by Blelloch and his collaborators [Blelloch, 1990, Blelloch et al., 1994]. This work was later extended to a higher-order language with a rich set of types [Chakravarty and Keller, 2000, Lechtchinsky et al., 2006]. Baker-Finch et al. [2000] formalized the semantics for the core of a lazy parallel language, GpH. This semantics is stratified into a sequential and parallel components. While alternative evaluation strategies are considered within this framework (e.g., fully speculative), only a single, non-deterministic scheduler is described. In GpH, evaluation strategies [Trinder et al., 1998] allow a functional programmer to explicitly control the structure parallel evaluation (e.g. divide-and-conquer, collection-oriented parallelism). Cost Semantics Non-standard semantics were used by both Rosendahl [1989] and Sands [1990] to automatically derive bounds on the time required to execute programs. In both cases, the semantics yields a cost that approximates the intensional behavior of programs. Cost semantics are also used to describe provably efficient implementations of speculative parallelism [Greiner and Blelloch, 1999] and explicit dataparallelism [Blelloch and Greiner, 1996]. These cost semantics yield directed, acyclic graphs that describe the parallel dependencies, just as the computation graphs in this proposal. Execution time on a bounded parallel machine is given in terms of the (parallel) depth and (sequential) work of these graphs. In one case [Blelloch and Greiner, 1996], an upper bound on space use is also given in terms of depth and work. This work was later extended to a language with recursive product and sum types [Lechtchinsky et al., 2002]. Ennals [2004] uses a cost semantics to compare the work performed by a range of sequential evaluation strategies, ranging from lazy to eager. Like the current proposal, he also uses cost graphs with distinguished types of edges, though his edges serve different purposes. He does not formalize the use of space by these different strategies. Gustavsson and Sands [1999] also use a cost semantics to compare the performance 5

of sequential, call-by-need programs. They give a semantic definition of what it means for a program transformation to be “safe for space” [Shao and Appel, 1994] and provide several laws to help prove that a given transformation does not asymptotically increase the space use of programs. To prove the soundness of these laws, they use a semantics that yields the maximum heap and stack space required for execution. In the context of a call-by-value language, Minamide [1999] showed that a CPS transformation is space efficient using a cost semantics. Jay et al. [1997] describe a static framework for reasoning about the costs parallel execution using a monadic language. Static cost models have also been used to automatically choose a parallel implementation during compilation based on hardware performance parameters [Hammond et al., 2003] and to inform the granularity of scheduling [Loidl and Hammond, 1996, Portillo et al., 2002]. Unlike this proposal, the latter work focuses on how the size of program data structures affect parallel execution (e.g. through communication costs), rather than how different execution models affect the use of space at a given point in time. Scheduling Within the algorithms community there has been significant research on the effect of scheduling policy on memory usage [Blumofe and Leiserson, 1998, Blelloch et al., 1999, Narlikar and Blelloch, 1999]. In that work, it has been shown that different schedules can have an exponential difference in memory usage with only two processors.

3

Cost Semantics

In this section, I will describe a cost semantics for this language. Recall that a cost semantics allows us to distinguish programs based not only on their final results, but also based on how those results are computed. Like previous work, the cost semantics in this proposal is a dynamic semantics. Thus, it yields results only for closed expressions, i.e. for a given program over a particular input. Just an in ordinary performance profiling, we must run a program over a series of inputs before we can generalize its behavior. The dynamic nature of the cost semantics has several implications. It means that extensions to the source language are quite straightforward. For example, adding recursive functions requires only minimal changes to the theorems and proofs in this proposal. Unlike a static analysis, which must compute a fixed point of the behavior for each recursive function, the cost semantics must only account for the dynamic behavior of these functions. This amounts to tracking an additional binding, and ensuring that the space costs associated with this binding are accurately accounted for. The fact that the binding is recursive is immaterial to my analysis. While this framework can easily accommodate possibly non-terminating programs, it gives no information about individual program instances that diverge. While this is certainly a limitation of my framework, it is also a limitation of current practice using heap profilers [Runciman and Wakeling, 1993]. Furthermore, this is not a significant issue with respect to the implementation of parallel algorithms (as opposed to web servers and other interactive applications) since we are only interested in instances of program execution that actually yield results. Consider a call-by-value functional language extended with parallel pairs. The language may easily be extended with other primitive types, recursive functions, and arrays, but I elide these for the sake of clarity. The syntax of source expressions is shown below. (expressions) e

::=

x | λx.e | e1 e2 | {e1 , e2 } | πi e

Remark. The static semantics of this language is completely standard. I will omit a definition of typing, and even the definitions of terms such as type safety, and I trust the reader to assume their conventional meanings. The cost semantics is given in terms of the following semantic objects. (values) v (locations) ` (environments) η

::= ∈ ::= 6

hη; x.ei` | hv1 , v2 i` L · | η, x 7→ v

Figure 2 Profiling Cost Semantics. In addition to a result, this semantics also yields two graphs that can be used to reconstruct the cost of obtaining that result. Computation graphs g record dependencies in time while heap graphs h record dependencies among values. The substitution [η]e of the values bound in η for the variables appearing in e is defined in Appendix A.2.

η . e ⇓ v; g; h (` fresh) η . λx.e ⇓ hη; x.ei` ; [`]; {(`, `0 )}`0 ∈locs([η](x.e))

(E-Fn)

(x 7→ v) ∈ η (n fresh) (E-Var) η . x ⇓ v; [n]; (n, loc(v))

η1 . e1 ⇓ hη2 ; x.e3 i`1 ; g1 ; h1 η1 . e2 ⇓ v2 ; g2 ; h2 η2 , x 7→ v2 . e3 ⇓ v3 ; g3 ; h3 (n fresh) (E-App) η1 . e1 e2 ⇓ v3 ; g1 ⊕ g2 ⊕ [n] ⊕ g3 ; h1 ∪ h2 ∪ h3 ∪ {(n, `1 ), (n, loc(v2 ))} η . e1 ⇓ v1 ; g1 ; h1

η . e2 ⇓ v2 ; g2 ; h2

(` fresh)

`

η . {e1 , e2 } ⇓ hv1 , v2 i ; g1 ⊗ g2 ⊕ [`]; h1 ∪ h2 ∪ {(`, loc(v1 )), (`, loc(v2 ))}

(E-Pair)

η . e ⇓ hv1 , v2 i` ; g; h (n fresh) (E-Proji ) η . πi e ⇓ vi ; g ⊕ [n]; h ∪ {(n, `)}

In this semantics, I maintain a distinction between expressions and values. Values are also annotated with locations so that sharing can be made explicit without resorting to an explicit heap: the syntax distinguishes between a pair whose components occupy the same space and one whose components do not. This will allow us to draw a tight connection between this semantics and the behavior of an implementation, in particular, with respect to its use of space. The cost semantics (Figure 2) is an evaluation semantics that computes both the result of the computation and an abstract cost reflecting how the result was obtained. While the semantics is sequential, the cost will allow us to reconstruct different parallel schedules and reason about the space use of programs executing with these schedules. The judgment η . e ⇓ v; g; h is read, in environment η, expression e evaluates to value v with computation graph g and heap graph h. The extensional portions of this judgment are completely standard in the way they relate expressions to values. As discussed below, edges in a computation graph represent control dependencies in the execution of a program, while edges in a heap graph represent dependencies on and between values. Computation Graphs The first part of the cost associated with each program, the computation graph, is a directed, acyclic graph. Each node in the computation graph represents the evaluation of a sub-expression, and edges represent dependencies between sub-expressions. Edges in the computation graph point forward in time: an edge from node n1 to node n2 indicates that n1 must be executed before n2 . Each computation graph has one source node (with in-degree zero) and one sink node (with out-degree zero), i.e. computation graphs are directed series-parallel graphs. Each such graph consists of a single-node, or of the sequential or parallel composition of smaller graphs. Nodes are denoted ` and n (and variants). Graph are written as tuples such as (ns ; ne ; E) where ns is the source or start node, ne is the sink or end node, and E is a list of edges. The remaining nodes of the graph are implicitly defined by the edge list. Single-node graphs and graph operations are defined below. In these diagrams, nodes are represented as

7

circles while arbitrary graphs are represented as diamonds. Time flows downward. Single Node [n]

Serial Composition (ns ; ne ; E) ⊕ (n0s ; n0e ; E 0 )

Parallel Composition (ns ; ne ; E) ⊗ (n0s ; n0e ; E 0 )

(n; n; )

(ns ; n0e ; E, E 0 , (ne , n0s ))

(n; n0 ; E, E 0 , (n, ns ), (n, n0s ), (ne , n0 ), (n0e , n0 )) n, n0 fresh

Heap Graphs The second part of the cost associated with each program, the heap graph, is also a directed, acyclic graph. Unlike computation graphs, heap graphs do not have distinguished start or end nodes. Each node again represents the evaluation of a sub-expression. While edges in the computation graph point forward in time, edges in the heap graph point backward in time. Edges represent a dependency on a value: if there is an edge from n to ` then n depends on the value at location `. It follows that any space associated with ` cannot be reclaimed until after n has executed and any space associated with n has also been reclaimed. Each heap graph shares nodes with the computation graph arising from the same execution. In a sense, computation and heap graphs may be considered as two sets of edges on a shared set of nodes. As above, the nodes of heap graphs are left implicit. Edges in the heap graph record both the dependences among values as well as dependencies on values by other parts of the program state. As an example of the first case, in the evaluation rule for pairs, two edges are added to the heap graph to represent the dependencies of the pair on each of its components. Thus, if the pair is reachable, so is each component. In the evaluation of a function application, however, two edges are added to express the use of values. The first such edge marks a use of the function. The second edge denotes a possible last use of the argument. For strict functions, this second edge is redundant: there will be another edge leading to the argument when it is used. However, for non-strict functions, this is the first point at which the garbage collector might reclaim the space associated with the argument. Consider the rule describing the evaluation of pairs, E-Pair. The cost graphs for this rule are displayed here.

Arrows in the computation graph point downward. This graph consists of two subgraphs (one for each components) and three additional nodes. The first node, at the top, represents the cost of forking a new parallel computation, and the second node, in the middle, represents the cost of joining these parallel threads. The final node represents the cost of the allocation of the pair. There are two heap edges (pointing upward and in bold) shown in the graph, representing the dependency of the pair on each of its components. Note that these components need not have been allocated as the final step in either sub-graph.

8

3.1

Schedules

Together, the computation and heap graphs allow a programmer to analyze the behavior of her program under a variety of hardware and scheduling configurations. A key component of this analysis is the notion of a schedule. Each schedule describes one possible parallel execution of the program and records which parallel tasks are executed at each time step. Every schedule must obey the constraints described by the computation graph g. Definition (Schedule). A schedule of a graph g = (ns ; ne ; E) is a sequence of sets of nodes N0 , . . . , Nk such that ns 6∈ N0 , ne ∈ Nk , and for all i ∈ [0, k), • Ni ⊆ Ni+1 , and • for all n ∈ Ni+1 , predg (n) ⊆ Ni . Where predg (n) is the set of nodes n0 such that there is an edge in g of the form (n0 , n). It also will be convenient to distinguish those nodes which are executed in a given time step from those that have executed in previous steps. Definition (Executed Nodes). Given a schedule N0 , . . . , Nk , the nodes executed at each step E1 , . . . , Ek are defined as Ei = Ni \Ni−1 for i ∈ [1, k]. For a schedule of a graph g, the sequence of sets of executed nodes corresponds to a pebbling [Hopcroft et al., 1977] of g.

3.2

Roots

To understand the use of space, the programmer must also account for the structure of the heap graph h. Given a schedule N0 , . . . , Nk for an associated graph g, consider the moment of time represented by some Ni . Because Ni contains all previously executed nodes and because edges in h point backward in time, each edge (n, `) in h will fall into one of the following three categories. • Both n, ` 6∈ Ni . In this case, since the value associated with ` has not yet been allocated, the edge (n, `) does not contribute to the use of space at time i. • Both n, ` ∈ Ni . While the value associated with ` has been allocated, the use of this value represented by this edge is also in the past. Again, the edge (n, `) does not contribute to the use of space at time i. • ` ∈ Ni , but n 6∈ Ni . In this case, the value associated with ` has already been allocated, and n represents at one possible use in the future. Here, the edge (n, `) does contribute to the use of space at time i. This leads us to the following definition. Definition (Roots). The roots of a heap graph h with respect to a location ` after evaluation of the nodes in N , written roots`,h (N ), is the set of nodes `0 in N where `0 = ` or h contains an edge leading from outside N to `0 . Symbolically, roots`,h (N ) = {`0 ∈ N | `0 = ` ∨ (∃n.(n, `0 ) ∈ h ∧ n 6∈ N )} I use the term roots to evoke a related concept from work in garbage collection. For the reader that is most comfortable thinking in terms of an implementation, the roots might correspond to those memory locations that are reachable directly from the processor registers or the call stack. This is just one possible implementation, however: the computation and heap graphs stipulate only the behavior of an implementation. This connection will be made precise in the next section.

9

4

Provable Implementations

While the evaluation semantics given in the previous section allows a programmer to draw conclusions about the performance of her program, these conclusions would be meaningless if the implementation of the language did not reflect the costs given by that semantics. In this section, I define several provable implementations [Blelloch and Greiner, 1996] of this language, each as a transition (small-step) semantics. The first is a non-deterministic semantics that defines all possible parallel executions. Each subsequent semantics will define the behavior of a particular scheduling algorithm. The following table gives a brief overview of all the semantics used in this proposal. Semantics (Figure)

Style

Judgment(s)

Notes

Cost (2)

big-step

η . e ⇓ v; g; h

sequential, profiling semantics

Primitive (3)

small-step

Non-deterministic (4)

small-step

Depth-first (5)

small-step

e −→ e0

axioms shared among parallel implementations

nd

e 7−−→ e0 nd d 7−−→ d0 df e 7−→ e0 df d 7−→ d0

defines all possible parallel executions algorithmic implementation favoring left-most subexpressions

As part of the implementation of this language, I extend the syntax to include a parallel let construct. This construct is used to denote expressions whose parallel evaluation has begun but not yet finished. Declarations within a let par may step in parallel, depending on the constraints enforced by one of the transition semantics below. Declarations and let par expressions reify a stack of expression contexts such as those that appear in many abstract machines (e.g. [Landin, 1964]). Unlike a stack, which has exactly one topmost element, there are many “leaves” in our syntax that may evaluate in parallel. We also include values within the syntax of expressions so that substitution will be well-defined. These extensions are shown below. (expressions) e (declarations) d (value declarations) δ

::= . . . | let par d in e | v ::= x = e | d1 and d2 ::= x = v | δ1 and δ2

As this semantics is defined using substitution, closures will always be trivial: the environment of every closure in this semantics will be empty. I use hx.ei` as an abbreviation for h·; x.ei` . To facilitate the definition of several different parallel semantics, I first factor out those parts of the semantics that are common to each variation. These primitive sequential transitions are defined by the following judgment. e −→ e0 This judgment represents the step taken by a single processor in one unit of time (e.g., allocating a pair, applying a function). Primitive transitions are defined by the axioms in Figure 3. These axioms limit where parallel evaluation may occur by defining the intermediate forms for the evaluation of pairs and function application. When exactly parallel evaluation occurs is defined by the scheduling semantics, as given in the remainder of this section.

4.1

Non-Deterministic Scheduling

The first implementation in this proposal is a non-deterministic nd transition semantics that defines all possible parallel executions. Though this semantics itself does not serve as a model for a realistic implementation, it is a useful tool in reasoning about other, more realistic, semantics. The non-deterministic semantics is defined by a pair of judgments nd

nd

e 7−−→ e0

d 7−−→ d0 10

Figure 3 Primitive Transitions. These rules encode transitions where no parallel execution is possible. They will be used in each of the different scheduling semantics that follow in this section. The substitution of a value declaration into an expression [δ]e is defined in Appendix A.2.

e −→ e0 (` fresh) λx.e −→ hx.ei`

(x1 , x2 fresh and e1 , e2 not values) e1 e2 −→ let par x1 = e1 in let par x2 = e2 in x1 x2

(P-Fn)

hx.ei` v2 −→ [v2 /x]e (` fresh) {v1 , v2 } −→ hv1 , v2 i`

(P-App)

(P-AppBeta)

(x fresh and e not a value) (P-Proji ) πi e −→ let par x = e in πi x

(P-Pair)

πi hv1 , v2 i` −→ vi

(P-Proji Beta)

(x1 , x2 fresh and e1 , e2 not values) (P-Fork) {e1 , e2 } −→ let par x1 = e1 and x2 = e2 in {x1 , x2 }

let par δ in e −→ [δ]e

(P-Join)

that state, expression e takes a single parallel step to e0 and, similarly, declaration d takes a single parallel step to d0 . This semantics allows unbounded parallelism: it models execution on a parallel machine with an unbounded number of processors. It is defined by the rules in Figure 4. Most of the nd rules are straightforward. The only non-determinism lies in the application of the rule nd-Idle. In a sense, this rule is complemented by nd-Branch: The latter says that all branches may be executed in parallel, but the former allows any sub-expression to sit idle during a given parallel step. 4.1.1

Extensional Behavior

Though this implementation is non-deterministic in how it schedules parallel evaluation, the result of nd evaluation will always be the same, no matter which expressions evaluate in parallel. This statement is formalized in the following theorem. (In this and other results below, we always consider equality up to the renaming of locations.) nd

nd

nd

Theorem 1 (Confluence). If e 7−→∗ e0 and e 7−→∗ e00 then there exists an expression e000 such that e0 7−→∗ e000 nd nd nd nd and e00 7−→∗ e000 . Similarly, d 7−→∗ d0 and d 7−→∗ d00 then there exists a declaration d000 such that d0 7−→∗ d000 nd and d00 7−→∗ d000 . Proof. As usual, this follows from the “diamond” property as shown in Lemma 1 below. nd

nd

nd

nd

Lemma 1. If e 7−→ e0 and e 7−→ e00 , then there exists an expression e000 such that e0 7−→ e000 and e00 7−→ e000 . nd nd nd nd Similarly, If d 7−→ d0 and d 7−→ d00 , then there exists a declaration d000 such that d0 7−→ d000 and d00 7−→ d000 . nd

nd

Proof. By induction on the derivations of e 7−−→ e0 and d 7−−→ d0 . nd Case nd-Idle: In this case e0 = e. Assume that e 7−−→ e00 was derived using rule R. Let e000 = e00 . Then nd nd we have e 7−−→ e00 (by applying R) and e00 7−−→ e00 (by nd-Idle), as required. As all of the non-determinism in this semantics is focused in the use of the nd-Idle rule, the remaining cases follow from the immediate application of the induction hypothesis.

11

Figure 4 Non-Deterministic Parallel Transition Semantics. This semantics defines all possible parallel transitions of an expression, including those that take an arbitrary number of primitive steps in parallel. Parallelism is isolated within transition expressions of the form let par. Declarations step in parallel using nd-Branch. Note that expressions (or portions thereof) may remain unchanged using the rule nd-Idle.

nd

e 7−−→ e0 nd

d 7−−→ d0 (nd-Let) nd let par d in e 7−−→ let par d0 in e

(nd-Idle)

nd

e 7−−→ e

e −→ e0 (nd-Prim) nd e 7−−→ e0

nd

d 7−−→ d0 nd

nd

d2 7−−→ d02

d1 7−−→ d01

nd

e 7−−→ e0 (nd-Leaf) nd x = e 7−−→ x = e0

nd

d1 and d2 7−−→ d01 and d02

(nd-Branch)

Before considering the intensional behavior of the parallel semantics, I prove several properties relating it behavior to that of the cost semantics. As such, I will temporarily ignore the cost graphs and write η . e ⇓ v if η . e ⇓ v; g; h for some g and h. The first such property (Completeness) states that any result obtained using the cost semantics can also be obtained using the nd implementation. To relate these results, I define an embedding of values from the cost semantics (including closures with non-empty environments) into values in the implementation semantics. This embedding substitutes away any variables bound in a closure. xhη; x.ei` y = xhv1 , v2 i` y =

hx.[xηy]ei` hxv1 y, xv2 yi`

Environments η are embedded piecewise by applying the embedding to each component value. nd

Theorem 2 (nd Completeness). If η . e ⇓ v then [xηy]e 7−→∗ xvy. The proof is carried out by induction on the derivation of η . e ⇓ v and is shown in Appendix B. The following theorem (Soundness) ensures that any result obtained by the implementation semantics can also be derived using the cost semantics. As my extensions to the source language given in this section represent runtime intermediate forms, I define an embedding of these new expression forms into the original syntax. Parallel let expression are embedded by substitution, and declarations are mapped to environments. plet par d in eq = [pdq]peq peq = e px = eq pd1 and d2 q

(e 6= let par d in e)

= x 7→ peq = pd1 q, pd2 q

I also define a vectorized form of the evaluation relation that evaluates several expressions simultaneously. It evaluates a list of variable-expressions pairs to a list of variable-value pairs, η.·⇓·

η1 . η2 ⇓ η20 η1 . e ⇓ v η1 . η2 , x 7→ e ⇓ η20 , x 7→ v

Finally, I extend the evaluation relation to relate values to themselves. (This extension requires only trivial changes to the previous theorem, since every value steps to itself in zero steps.) η.v ⇓v

(E-Val)

12

nd

nd

Theorem 3 (nd Soundness). If [η]e 7−→∗ v, then η . peq ⇓ xvy. Similarly, if [η]d 7−→∗ δ, then η . pdq ⇓ xδy. Proof. By induction on the number of steps n in the sequence of transitions. Case 0: In this case, e = v and therefore, peq = v. Since every value is related to itself under the evaluation relation, the case is proved. The same applies to d and δ. nd nd Case n > 0: Thus [η]e 7−−→ [η 0 ]e0 and [η 0 ]e0 7−−→∗ v. Inductively, we have η 0 . pe0 q ⇓ xvy. The remainder of this case is given by the following lemma. nd

nd

nd

Lemma 2. If [η]e 7−→ [η 0 ]e0 , [η 0 ]e0 7−→∗ v, and η 0 . pe0 q ⇓ xvy, then η . peq ⇓ xvy. Similarly, if [η]d 7−→ [η 0 ]d0 , nd [η 0 ]d0 7−→∗ δ, and η 0 . pd0 q ⇓ xδy, then η . pdq ⇓ xδy. nd

nd

The proof is carried out by induction on the derivations of e 7−−→ e0 and d 7−−→ d0 and is shown in Appendix B. 4.1.2

Intensional Behavior

Having considered the extensional behavior of this implementation, I now turn to its intensional behavior. As we take the semantics to define all possible parallel executions, it should be the case the any schedule we derive from the cost semantics is implemented by a sequence of parallel steps, as defined by the transition relation. This statement is made precise in the following theorem. Theorem 4 (Cost Completeness). If e ⇓ v; g; h and N0 , . . . , Nk is a schedule of g then there exists a nd sequence of expressions e0 , . . . , ek such that e0 = e and ek = xvy and for all i ∈ [0, k), ei 7−→ ei+1 and locs(ei ) ⊆ rootsv,h (Ni ). Here I write rootsv,h (N ) as an abbreviation for the roots of the location of that value, or rootsloc(v),h (N ). The locations of an expression locs(e) are simply the locations of any values embedded in that expression. The location of a value is the outermost location of that value. Locations of expressions and values are defined in Appendix A.1. The final condition of the theorem states that the use of space in the parallel semantics, as determined by locs(), is approximated by the measure of space in the cost graphs, as given by roots(). Theorem 5 (Cost Soundness). If e ⇓ v; g; h and e0 , . . . , ek is a sequence of expressions such that e0 = e nd and ek = xvy and for all i ∈ [0, k), ei 7−→ ei+1 then there exists a schedule of g given by N0 , . . . , Nk with locs(ei ) ⊆ rootsv,h (Ni ). Both these theorems must be generalized to account for evaluation in a non-empty environment, but in both cases, are pr-oven much like their extensional counterparts above.

4.2

Depth-First Scheduling

I now define an alternative transition semantics that is deterministic and implements a depth-first schedule. Depth-first (df) schedules, defined below, prioritize the leftmost sub-expressions of a program and always complete the evaluation of these leftmost sub-expressions before proceeding to sub-expressions on the right. The semantics in this section implements a p-depth-first (p-df) scheduler, a scheduler that uses at most p processors. As a trivial example, a left-to-right sequential evaluation is equivalent to a one processor or 1-df schedule. Just as we defined the non-deterministic implementation as a transition relation, we can do the same for the depth-first implementation. The p-df transition semantics is defined on configurations p; e and p; d. These configurations describe an expression or declaration together with an integer p that indicates the number of processors that have not yet been assigned a task in this parallel step. At the root of the derivation of each parallel step, p will be equal to the total number of processors. Within a derivation, p may be smaller but not less than zero. The semantics is given by the following pair of judgments. df

df

p; e 7−→ p0 ; e0

p; d 7−→ p0 ; d0 13

Figure 5 p-Depth-First Parallel Transition Semantics. This deterministic semantics defines a single parallel step for left-to-right depth-first schedule using at most p processors. Configurations p; e and p; d describe expressions and declarations with p unused processors remaining in this time step.

df

p; e 7−→ p0 ; e0 df

p; d 7−→ p0 ; d0

(DF-Let)

df

p; let par d in e 7−→ p0 ; let par d0 in e df

0; e 7−→ 0; e

df

p; v 7−→ p; v

(DF-Val)

e −→ e0 (DF-Prim) df p + 1; e 7−→ p; e0

(DF-None)

df

p; d 7−→ p0 ; d0 df

p; x = e 7−→ p0 ; x = e0

df

df

df

p; e 7−→ p0 ; e0

p0 ; d2 7−→ p00 ; d02

p; d1 7−→ p0 ; d01

(DF-Leaf)

df

p; d1 and d2 7−→ p00 ; d01 and d02

(DF-Branch)

These judgments define a single parallel step of an expression or declaration. The first is read, given p available processors, expression e steps to expression e0 with p0 processors remaining unused. The second has an analogous meaning for declarations. The p-df transition semantics is defined by the rules given in Figure 5. Most notable is the DF-Branch rule. It states that a parallel declaration may take a parallel step if any available processors are used first on the left sub-declaration and then any remaining available processors are used on the right. Like the non-deterministic semantics above, the p-df transition semantics relies on the primitive transitions given in Figure 3. In rule DF-Prim, one processor is consumed when a primitive transition is applied. For the df semantics, we must reset the number of available processors after each parallel step. To do so, we define a “top-level” transition judgment for df evaluation with p processors. This judgment is defined by exactly one rule, shown below. Note that the number of processors remaining idle p0 remains unconstrained. df

p; e 7−→ p0 ; e0 p-df

e 7− −− → e0 The complete evaluation of an expression, as for the non-deterministic semantics, is given by the reflexive, p-df ∗ transitive closure of the transition relation 7−−− → . We now consider several properties of the df semantics. First, unlike the non-deterministic implementation, this semantics defines a particular evaluation strategy. df

df

Theorem 6 (Determinacy of df Evaluation). If p; e 7−→ p0 ; e0 and p; e 7−→ p00 ; e00 then p0 = p00 and df df e0 = e00 . Similarly, if p; d 7−→ p0 ; d0 and p; d 7−→ p00 ; d00 then p0 = p00 and d0 = d00 . The proof is carried out by induction on the first derivation and hinges on the following two facts: first, that DF-Val and DF-Val yield the same results, and second, that in no instance can both DF-Let and DF-Prim be applied. We can easily show the df semantics is correct with respect to the cost semantics, simply by showing that its behavior is contained within that of the non-deterministic semantics. df

nd

df

nd

Theorem 7 (df Soundness). If p; e 7−→ p0 ; e0 then e 7−→ e0 . Similarly, If p; d 7−→ p0 ; d0 then d 7−→ d0 . df

Proof. By induction on the derivation of p; e 7−→ p0 ; e0 . Cases for derivations ending with rules DF-Let, DF-Leaf, and DF-Branch follow immediately from appeals to the induction hypothesis and analogous 14

rules in the non-deterministic semantics. DF-Prim also follows from its analogue. Rules DF-None and DF-Val are both restrictions of nd-Idle. p-df

nd

It follows immediately that if e 7− −− → e0 then e 7−−→ e0 . This result shows the benefit of defining and reasoning about a non-deterministic semantics: once we have shown the soundness of an implementation with respect to the non-deterministic semantics, we get soundness with respect to the cost semantics for free. Thus, we know there is some schedule that accurately models behavior of the df implementation. It only remains to pick out precisely which schedule does so. To allow programmers to understand the behavior of this semantics, I define a more restricted form of schedule. For each computation graph g there is a unique p-df schedule. As shown below, these schedules precisely capture the behavior of the df implementation. Definition (p-Depth-First Schedule). A p-depth-first schedule of a graph g is a schedule N0 , . . . , Nk such that all i ∈ [0, k), • |Ei+1 | ≤ p and • for all n ∈ Ni+1 , n0 ∈ Ni , n0  n, and for any other schedule of g given by N00 , . . . , Nj0 that meets these constraints, n ∈ Ni0 ⇒ n ∈ Ni . Here n0  n if n0 appears first in the list of edges in g. The final condition ensures that the depth-first schedule is the most aggressive schedule that respects this ordering of nodes: if it is possible to execute node n at time i (as evidenced by its membership in Ni0 ) then any depth-first schedule must also do so. Theorem 8 (df Completeness). If e ⇓ v; g; h and N0 , . . . , Nk is a p-df schedule of g then there exists p-df a sequence of expressions e0 , . . . , ek such that e0 = e and ek = v and for all i ∈ [0, k), ei 7−−→ ei+1 and locs(ei ) ⊆ rootsv,h (Ni ). This theorem must be generalized not only over arbitrary environments, but also over df schedules which may use a different number of processors at each time step. This allows a p-df schedule to be split into two df schedules that, when run in parallel, never use more than p processors in a given step. The proof hinges on the fact that any df schedule can be split in this fashion, and moreover, that it can be split so that the left-hand side is allocated all the processors (up to p) that it could possibly make use of.

4.3

Breadth-First Scheduling

Just as the semantics in the previous section captured the behavior corresponding to a depth-first pebbling of the computation graph, we can also give an implementation corresponding to a breadth-first (bf) pebbling. This is the most “fair” schedule in the sense that it distributes computational resources evenly across parallel tasks. For example, given four parallel tasks, a 1-bf scheduler alternates round-robin between the four. A 2-bf scheduler takes one step for each of the first two tasks in parallel, followed by one step for the second two, followed again by the first two, and so on. I omit the presentation of this semantics and only state that a theorem making a precise correspondence between breadth-first schedules and this implementation, similar to that shown above for the depth-first case, can also be proved. 4.3.1

Example: Quicksort

We now return to the parallel implementation of quicksort described in Section 1.2. The plot shown in Figure 1 is derived from a direct implementation of the cost semantics in Standard ML. The computation and heap graphs are automatically analyzed to determine an upper bound on the total space required for each scheduling policy. This space is determined by the number of nodes in the heap graph reachable from the roots(). 15

Recall that the four configurations described in the Figure 1 are (from top to bottom) a 1-df (i.e. sequential) schedule, a 2-df schedule, a 3-df schedule, and p-bf schedules for p ≤ 3. While the second requires less space than the first, both require space that is polynomial in the input size. The space used in the third and fourth configurations is a linear function of the input size. In all cases, we are considering the worst-case behavior for this example: sorting a list whose elements are in reverse order. The p-df schedules for p ≤ 2 require more space because they begin to partition many lists (one for each recursive call) before the first such partitions are completed. In the worst case, each of these in-progress partitions requires O(N ) space, and there are N such partitions. The p-bf schedules avoid this asymptotic blowup in space use by completing all partitions at a given recursive depth before advancing to the next level of recursion. This implies that no matter how many partitions are in-progress, only O(N ) elements will appear in these lists. The p-df schedules for p > 2 achieve the same performance by exhausting all available parallelism and, in effect, simulating the breadthfirst schedules. Figure 6(a) shows an example of cost graphs for the quicksort example. Both the computation and heap graphs are “distilled” to reveal their essential characteristics. For example, long sequences of sequential computation are represented as a single node, a node whose size is proportional to the length of that sequence. Heap edges are between two composite nodes are weighted by the size of all transitively reachable objects. 4.3.2

Example: Numerical Integration

Our second example concerns the numerical integration of following degree five univariate polynomial. f (x) = (x + 3) × (x − 10) × (2x − 20) × (3x − 24) × (4x + 15) We approximate the integral between ±10 using the adaptive rectangle rule. This algorithm computes the area under f as a series of rectangles, using narrower rectangles in regions where f seems to be undergoing the most change. Whenever the algorithm splits an interval, the approximation of each half is computed in parallel. Note that the parallel structure of this example is determined not by the size or shape of a data structure, but instead by the behavior of the function f . Increasing the amount of available parallelism allows us to calculate better approximations in a given period of time. The code for this algorithm is shown below. fun integrate f (a, b) = let val mid = (a + b) / 2.0 val xdif = b − a val ydif = (f b) − (f a) in if withinTolerance (xdif, ydif) then (∗ approximate ∗) (f mid) ∗ xdif else (∗ divide and recur ∗) let val (l, r) = { integrate f (a, mid), integrate f (mid, b) } in l+r end end Where withinTolerance is a predicate that determines if a rectangular approximation is sufficiently accurate. Figure 7 shows an upper bound on the space required by this program. Each data point represents an execution with the same input but with a different number of processors. Unlike the quicksort example, the depth-first schedule requires far less space for a small number of processors. Even as we increase the number of processors, the depth-first scheduler only gradually increases 16

Figure 6 Cost Graphs for Quicksort. Summarized computation and heap graphs for the examples in (a) Section 1.2 and (b) Section 4.3.3. Both represent evaluations of qsort [4,3,2,1]. Graph are “distilled” to reveal their essential characteristics. The graph on the right shows parallel evaluation is more restricted in the second version of quicksort (though this particular input offers relatively few opportunities for parallel evaluation to begin with).

(a)

(b) 17

Figure 7 Space Use as a Function of Number of Processors. This figure shows an upper bound on the space required to numerically integrate a function using the code given in the text. The breadth-first scheduler requires a nearly constant of amount of space, regardless of the number of processors. The depth-first scheduler gradually increases space requirements as the number of processors increases.

space high-water mark

300 250

breadth-first depth-first

200 150 100 50 5

10 15 20 # processing elements

25

its use of space. In contrast, the breadth-first scheduler uses a large amount of space, independently of the number of PE. Given enough PE, the depth-first scheduler will eventually emulate the behavior (and the performance) of the breadth-first scheduler. In this example, the depth-first scheduler capitalizes on the fact that the result returned by this function is only an integer value. Thus by focusing resources on a small number of intervals and immediately aggregating the results, it makes efficient use of space. The breadth-first scheduler, on the other hand, expands many intervals before computing any sums. 4.3.3

Example: Quicksort Revisited

Recall that the poor performance of the depth-first scheduler in the implementation of quicksort above is due to the input argument xs is appears in both branches of the parallel evaluation. In light of this, consider an alternative implementation where the recursive case is structured as follows. ... | x:: ⇒ let val (ls, gs) = {filter (le x) xs, filter (gt x) xs} in append {qsort ls, qsort gs} end In this version, we partition the list in parallel, but then synchronize before recursively sorting each sub-list. This version makes a better use of space under a depth-first schedule, but it does so at the cost of introducing more constraints on parallel execution. In particular, by synchronizing before the recursive call, it ensures that the depth-first scheduler will use only O(N ) space. Figure 6(b) shows summarized cost graphs this version of quicksort. While this example shows that even in a declarative language such as ours, programmers have a measure of control over performance, it also points to potential problem: otherwise innocuous program transformations may adversely affect space usage. In this case, program variables ls and gs are each used exactly once and are prime candidates for inlining. Inlining the definitions of ls and gs, however, produces exactly the version

18

of the code given in Section 1.2. I may explore, as part of my continuing work, a characterization of program transformations (such as inlining) that may be safely performed in the context of a parallel language.

5

Discussion

This section gives a brief look into the design of the cost semantics and the implications of my design choices. One might ask, are there other (useful) cost semantics for this language? Does the cost semantics reflect a particular implementation? Are there other implementations that also adhere to the specifications set by the cost semantics? Consider as an example the following alternative rule for the evaluation of function application. The premises remain the same, and the only difference from the conclusion in Figure 2 is highlighted with a rectangle. η1 . e1 ⇓ hη2 ; x.e3 i`1 ; g1 ; h1 η1 . e2 ⇓ v2 ; g2 ; h2 · · · η1 . e1 e2 ⇓ v3 ; g1 ⊕ g2 ⊕ g3 ⊕ [n] ; h1 ∪ h2 ∪ h3 ∪ {(n, `1 ), (n, loc(v2 ))} This rule yields the same result as the version given in Figure 2. However, it admits more implementations. Recall that the heap edges (n, `1 ) and (n, loc(v2 )) represent possible last uses of the function and its argument, respectively. This variation of the rule moves those dependencies until after the evaluation of the function body. With this rule, one could prove the preservation of costs for implementations that preserved these values, even in the case where they are not used during the function application. In contrast, the original version of this rule requires that these values be considered for reclamation by a garbage collector as soon as the function is applied. Suppose that an implementation stores the values of variables on a stack of activation records. Before the application, references to hη2 ; x.e3 i`1 and v2 will appear in the current record. If the implementation conforms to the original application rule, then it must either clear references to these values in the current record before the function is applied or inform the collector that these values are no longer live. An implementation that converts code into a continuation-passing style (CPS) [Appel, 1992] and heap-allocates activation records is also constrained by this rule: it must not allocate space for these values unless the corresponding variables appear somewhere in the closure. As this example suggests, there are many different implementations of the cost semantics. There is also leeway in the choice of the semantics itself. The goal was to find a semantics that describes a set of common implementation techniques. When they don’t align, my experience suggests that either the semantics or the implementation can be adjusted to allow them to fit together.

6

Conclusion

Before giving a timetable for my remaining work, I briefly discuss several topics that will form the remainder of this thesis.

6.1

Language Extensions

The language I have considered thus far is a pure functional language, i.e. a language without side effects. One avenue of further work is to consider effectful extensions to the language, such as mutable references or arrays. This extension presents two challenges. First, my analysis of heap graphs depends on the fact that, at each step in the schedule, the set of executed heap nodes is an approximation of the heap state at that point in time. That is, my analysis assumes that heap graphs are persistent structures. In a programming language with mutable state, the semantics of a heap update must preserve enough information to reconstruct both the original heap structure and the new structure resulting from the update. A second and perhaps more significant challenge is determining the parallel semantics of a language with effects. The semantics above relies on the fact that no thread of execution can influence the behavior of any 19

other thread. Thus every schedule will yield the same (extensional) result. With certain kinds of effects, the interleaving of threads may affect this result. There are several possible solutions to this second problem. We can restrict parallel evaluation to a pure fragment of the language, perhaps using a monad to stratify the language. This would maintain consistency with a sequential implementation. Alternatively, we can give different semantics, for example, by prioritizing updates not based on the order in which they occur but by their position in the program. While this would not be consistent with a sequential semantics it would still give a well-defined behavior that could be used by programmers to reason about their parallel programs independently of the scheduling policy.

6.2

Implementation

The implementations discussed above are still relatively abstract in that they rely on operations, such as substitution, that are not readily implemented at the level of modern hardware. Part of the remainder of my thesis will consist of a compiler and runtime system for a parallel functional language. One possibility for a more concrete implementation is to adapt an existing sequential compiler to support a parallel language. MLton is an open-source compiler for Standard ML that uses whole-program analysis to achieve good performance [MLton]. This performance and MLton’s simple design make it an attractive option for a parallel implementation. This implementation would use one of MLton’s two existing backends, which generate either C or native x86 code, to target multi-core and other shared-memory multi-processor systems based on the x86 architecture. A parallel version of MLton can be broken down roughly into two parts: compiler changes and runtime support. Changes to the compiler should be relatively simple. They include adding new primitive operations to allow synchronization among threads and access to the scheduler. I must also ensure that these primitives are handled properly by MLton’s many optimizations. Changes to the runtime will be more significant. Here, I must not only implement the additional primitives required to schedule parallel tasks, but also ensure that existing runtime operations are thread-safe. For example, allocation and collection routines must properly synchronize simultaneous requests from multiple threads. To provide reasonable performance, operations such as allocation must be thread-local in the common case. One open issue concerns the granularity of parallel tasks. It may be more efficient to occasionally execute a set of parallel tasks sequentially rather than incur the overhead of communicating with the scheduler. In this case, one implementation strategy is to produce two versions of each function: one that adds parallel tasks (if any) to the queue and a second that executes them sequentially. Tasks would be added to the queue either with some fixed frequency or until parallel execution reaches a given depth. Alternatives In the case that a MLton implementation proves to be too time-consuming to fit within the scope of my thesis, several alternative implementations are possible. Both Nepal [Chakravarty and Keller, 2000] and NESL [Blelloch and Greiner, 1996] provide implementations of a flattened semantics that would allow me to test predictions made by the cost semantics. There is also an opportunity to explore alternative flattening techniques within these implementations, including implementations based on depth-first traversals of the cost graphs. Graphics processors present another opportunity for implementation. As in the case of Nepal or NESL, this would again limit me to a data parallel implementation. However, from the point of view of the central processor, program execution is sequential. This would simplify many aspects of the implementation discussed above.

6.3

Applications

In addition to the small examples discussed in the text above (e.g., quicksort, sieve of Eratosthenes), I plan to implement one or more larger parallel programs taken from areas such as graph theory or mesh generation. While I have listed applications as the third part of my ongoing work, I plan to identify several potential applications before beginning significant work on other parts of my thesis. These applications will serve not 20

only to demonstrate my techniques, but also to drive the extensions to my language and the implementation described in this section.

6.4

Plan of Work

I conclude with a detailed plan for the remainder of this thesis.

Area

Task

Duration

Language Extensions

Impure Semantics. Determine parallel semantics for an impure language

5%

Mutable State. Integrate mutable state into cost semantics

10%

Compiler Support. Extend syntax, elaborator, and internal languages; verify new primitives are treated correctly by optimizations

10%

Runtime Support. Add synchronization and scheduling primitives; implement thread-safe versions of existing runtime operations (e.g., allocation, collection)

20%

Testing and Instrumentation. Verify correctness and performance of implementation; add appropriate instrumentation to validate predictions

10%

Implementation. Implement several small as well as one or more larger examples

15%

Evaluation. Analyze behavior of programs both using cost semantics and empirically

10%

Writing. Complete the thesis document

20%

Implementation

Applications

Dissertation

References Shail Aditya, Arvind, Jan-Willem Maessen, and Lennart Augustsson. Semantics of ph: A parallel dialect of haskell. Technical Report Computation Structures Group Memo 377-1, MIT, June 1995. Andrew W. Appel. Compiling with continuations. Cambridge University Press, New York, NY, USA, 1992. ISBN 0-521-41695-7. Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4):598–632, 1989. ISSN 0164-0925.

21

Clem Baker-Finch, David King, and Phil Trinder. An operational semantics for parallel lazy evaluation. In ACM-SIGPLAN International Conference on Functional Programming (ICFP’00), pages 162–173. ACM, 2000. G. E. Blelloch. Vector Models for Data-Parallel Computing. MIT Press, Cambridge, MA, 1990. G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing, 21(1):4–14, April 1994. Guy Blelloch, Phil Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the Association for Computing Machinery, 46(2):281–321, 1999. Guy E. Blelloch and John Greiner. Parallelism in sequential functional languages. In Functional Programming Languages and Computer Architecture, pages 226–237, 1995. Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of nesl. In ACM SIGPLAN International Conference on Functional Programming, pages 213–225, May 1996. R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM Journal of Computing, 27(1):202–229, 1998. Manuel M. T. Chakravarty and Gabriele Keller. More types for nested data parallel programming. In ICFP ’00: Proceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages 94–105, New York, NY, USA, 2000. ACM Press. ISBN 1-58113-202-6. Robert Ennals. Adaptive Evaluation of Non-Strict Programs. PhD thesis, University of Cambridge, 2004. John T. Feo, David C. Cann, and Rodney R. Oldehoeft. A report on the sisal language project. J. Parallel Distrib. Comput., 10(4):349–366, 1990. ISSN 0743-7315. John Greiner and Guy E. Blelloch. A provably time-efficient parallel implementation of full speculation. ACM Transactions on Programming Languages and Systems, 21(2):240–285, 1999. J¨ orgen Gustavsson and David Sands. A foundation for space-safe transformations of call-by-need programs. In Proceedings of Workshop on Higher Order Operational Techniques in Semantics, number volume 26 of Electronic Notes in Theoretical Computer Science, September 1999. Kevin Hammond, Jost Berthold, and Rita Loogen. Automatic skeletons in template haskell. Parallel Processing Letters, 13(3):413–424, September 2003. John Hopcroft, Wolfgang Paul, and Leslie Valiant. On time versus space. J. ACM, 24(2):332–337, 1977. ISSN 0004-5411. Paul Hudak and Steve Anderson. Pomset interpretations of parallel function programs. In Proc. of a conference on Functional programming languages and computer architecture, pages 234–256, London, UK, 1987. Springer-Verlag. ISBN 0-387-18317-5. Intel. 80-core programmable processor first to deliver teraflops performance. URL http://www.intel.com/. C. Barry Jay, Murray Cole, M. Sekanina, and Paul Steckler. A monadic calculus for parallel costing of a functional language of arrays. In Euro-Par ’97: Proceedings of the Third International Euro-Par Conference on Parallel Processing, pages 650–661, London, UK, 1997. Springer-Verlag. ISBN 3-540-63440-1. Peter J. Landin. The mechanical evaluation of expressions. Computer Journal, 6, Jan 1964. R. Lechtchinsky, M.M.T. Chakravarty, and G. Keller. Higher Order Flattening. In V. Alexandrov, D. van Albada, P. Sloot, and J. Dongarra, editors, International Conference on Computational Science (ICCS 2006), LNCS. Springer, 2006. 22

Roman Lechtchinsky, Manuel M. T. Chakravarty, and Gabriele Keller. Costing nested array codes. Parallel Processing Letters, 12(2):249–266, 2002. Hans-Wolfgang Loidl and Kevin Hammond. A sized time system for a parallel functional language. In Proceedings of the Glasgow Workshop on Functional Programming, Ullapool, Scotland, July 1996. James R. McGraw. The val language: Description and analysis. ACM Trans. Program. Lang. Syst., 4(1): 44–82, 1982. ISSN 0164-0925. Yasuhiko Minamide. Space-profiling semantics of the call-by-value lambda calculus and the cps transformation. In Andrew D. Gordon and Andrew M. Pitts, editors, The Third International Workshop on Higher Order Operational Techniques in Semantics, volume 26 of Electronic Notes in Theoretical Computer Science. Elsevier, 1999. MLton. An open-source, whole-program, optimizing standard ml compiler. URL http://www.mlton.org/. G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling of nested parallelism. ACM Trans. on Programming Languages and Systems, 21(1):138–173, 1999. John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kr¨ uger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26:80–113, March 2007. ´ Alvaro J. Reb´on Portillo, Kevin Hammond, Hans-Wolfgang Loidl, and Pedro B. Vasconcelos. Cost analysis using automatic size and time inference. In Ricardo Pena and Thomas Arts, editors, Implementation of Functional Languages, 14th International Workshop, IFL 2002, Madrid, Spain, September 16-18, 2002, Revised Selected Papers, volume 2670 of Lecture Notes in Computer Science, pages 232–248. Springer, 2002. ISBN 978-3-540-40190-2. P. Roe. Parallel Programming Using Functional Languages. PhD thesis, Department of Computing Science, University of Glasgow, 1991. Mads Rosendahl. Automatic complexity analysis. In FPCA ’89: Proceedings of the fourth international conference on Functional programming languages and computer architecture, pages 144–156, New York, NY, USA, 1989. ACM Press. ISBN 0-89791-328-0. Colin Runciman and David Wakeling. Heap profiling of lazy functional programs. J. Funct. Program., 3(2): 217–245, 1993. D. Sands. Calculi for Time Analysis of Functional Programs. PhD thesis, Department of Computing, Imperial College, University of London, September 1990. Zhong Shao and Andrew W. Appel. Space-efficient closure representations. In LFP ’94: Proceedings of the 1994 ACM conference on LISP and functional programming, pages 150–161, New York, NY, USA, 1994. ACM Press. ISBN 0-89791-643-3. Philip W. Trinder, Kevin Hammond, Hans-Wolfgang Loidl, and Simon L. Peyton Jones. Algorithm + Strategy = Parallelism. Journal of Functional Programming, 8(1):23–60, January 1998.

23

A

Definitions

For completeness, I give several definitions. Most are straightforward, and any interesting cases were discussed explicitly in the text above.

A.1

Locations

The location of a value is the outermost location of that value. It serves to uniquely identify that value. The locations of an expression are the locations of any values that appear in that expression. Similarly for declarations. loc(hx.ei` ) = ` loc(hv1 , v2 i` ) = `

A.2

locs(λx.e) locs(e1 e2 ) locs({e1 , e2 }) locs(πi e) locs(let par d in e)

= = = = =

locs(e) locs(e1 ) ∪ locs(e2 ) locs(e1 ) ∪ locs(e2 ) locs(e) locs(d) ∪ locs(e)

locs(x = e) locs(d1 and d2 )

= =

locs(e) locs(d1 ) ∪ locs(d2 )

Substitution

Substitution, as used in Sections 3 and 4, is a standard capture-avoiding substitution.

B

[v/x]x [v/x]y [v/x]c [v/x](λy.e) [v/x](e1 e2 ) [v/x]{e1 , e2 } [v/x](πi e) [v/x](let par d in e)

= = = = = = = =

v y (if x 6= y) c λy.([v/x]e) (if x 6= y) ([v/x]e1 ) ([v/x]e2 ) {[v/x]e1 , [v/x]e2 } πi ([v/x]e) let par [v/x]d in [v/x]e

[v/x](x = e) [v/x](d1 and d2 )

= x = [v/x]e = [v/x]d1 and [v/x]d2

[x = v]e [δ1 and δ2 ]e

= [v/x]e = [δ1 ][δ2 ]e

[·]e [η, x 7→ v]e

= e = [η][v/x]e

Proofs nd

Theorem 2 (nd Completeness). If η . e ⇓ v then [xηy]e 7−→∗ xvy. Proof. By induction on the derivation of η . e ⇓ v. Case E-Fn: We apply ND-Prim along with P-Fn to achieve the desired result. nd Case E-Var: Since (x 7→ v) ∈ η, it follows that [xηy]e = xvy, and xvy 7−−→0 xvy. Case E-App: First, [xηy](e1 e2 ) = ([xηy]e1 ) ([xηy]e2 ). Applying ND-Prim along with P-App, we have nd nd let par x1 = [xηy]e1 in let par x2 = [xηy]e2 in x1 x2 . Inductively, [xηy]e1 7−−→ ∗ xhη 0 ; x.ei` y and [xηy]e2 7−−→ 24



xv2 y. We apply rules ND-Let, ND-Leaf, and ND-Prim to the let par expressions at each step. We obtain the final result by application of P-Join and P-AppBeta (along with ND-Prim in both cases). Case E-Pair: Here, [xηy]{e1 , e2 } = {[xηy]e1 , [xηy]e2 }. Applying ND-Prim along with P-Fork, we have nd nd let par x1 = [xηy]e1 and x2 = [xηy]e2 in {x1 , x2 }. Inductively, [xηy]e1 7−−→∗ xv1 y and [xηy]e2 7−−→∗ xv2 y. We again apply rules ND-Let, ND-Branch, ND-Leaf, and ND-Prim to the let par expression at each step (also using ND-Idle in the case where the two sub-computations are of different lengths). We obtain the final result by application of P-Join and P-Pair (along with ND-Prim in both cases). nd Case E-Proji : [xηy](πi e) = πi [xηy]e and πi [xηy]e 7−−→ let par x = [xηy]e in πi x by rule P-Proji with nd ND-Prim. Inductively, we have [xηy]e 7−−→∗ xhv1 , v2 i` y. Applying rules ND-Let, ND-Leaf, and ND-Prim nd at each step, we have let par x = [xηy]e in πi x 7−−→∗ let par x = xhv1 , v2 i` y in πi x. We apply rules P-Join and P-Proji Beta (along with ND-Prim in both cases) to yield the final result. nd

nd

nd

Lemma 2. If [η]e 7−→ [η 0 ]e0 , [η 0 ]e0 7−→∗ v, and η 0 . pe0 q ⇓ xvy, then η . peq ⇓ xvy. Similarly, if [η]d 7−→ [η 0 ]d0 , nd [η 0 ]d0 7−→∗ δ, and η 0 . pd0 q ⇓ xδy, then η . pdq ⇓ xδy. nd

nd

Proof. By induction on the derivations of e 7−−→ e0 and d 7−−→ d0 . nd Case ND-Let: We have e = let par d in e1 and e0 = let par d0 in e1 , as well as a derivation of d 7−−→ d0 . 0 0 0 0 Note that peq = [pdq]pe1 q and pe q = [pd q]pe1 q. It is easily shown that η . [pd q]pe1 q ⇓ xvy if and only if η 0 . pd0 q ⇓ xδy and η 0 . [xδy]pe1 q ⇓ xvy. Inductively, we have η . pdq ⇓ xδy. Applying ND-Let yields the desired result. Case ND-Idle: In this case, e0 = e, and therefore, pe0 q = peq. The required result is one our assumptions. Case ND-Prim: We must consider each primitive transition. P-App, P-Proji , and P-Fork follow immediately since pe0 q = e. In the case of P-Join, peq = e0 . P-Fn follows by application of E-Fn. For the cases of P-Pair, P-Proji Beta, and P-AppBeta, we apply rules E-Pair, E-Proji , and E-App (respectively), using the fact that every value is related to itself in the cost semantics. nd Case ND-Leaf: We have [η]e 7−−→ [η 0 ]e0 as a sub-derivation. Inductively, we have η . peq ⇓ xvy and the result follows immediately. nd nd Case ND-Branch: Here, we have [η]pd1 q 7−−→ [η 0 ]d01 and [η]pd2 q 7−−→ [η 0 ]d02 as sub-derivations. Inductively, we have η . pd1 q ⇓ xδ1 y and η . pd2 q ⇓ xδ2 y. The result follows immediately.

25